diff --git a/.github/configs/CONFIGS.md b/.github/configs/CONFIGS.md
index 9d3c24309..302605fbb 100644
--- a/.github/configs/CONFIGS.md
+++ b/.github/configs/CONFIGS.md
@@ -47,6 +47,58 @@ Notes:
 - No extra fields besides the ones listed may be specified, or else the benchmarks will fail to run.
 - Setting the fields above, particularly `ep` and `dp-attn`, only guarantee that the respective values will be passed as environment variables to the benchmark scripts! Actually using those environment variables is an implementation detail at the level of the benchmark Bash script.
 
+## Multi-node srt-slurm recipes
+
+Multi-node configs that dispatch via `srt-slurm` (i.e. `srtctl apply -f …`) reference their recipe as a first-class field on the search-space entry:
+
+```yaml
+search-space:
+- spec-decoding: "mtp"
+  conc-list: [1214]
+  recipe: "trtllm/b200-fp4/1k1k/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml"
+  prefill:
+    num-worker: 1
+    tp: 4
+    ep: 4
+    dp-attn: true
+  decode:
+    num-worker: 2
+    tp: 8
+    ep: 8
+    dp-attn: true
+```
+
+- `recipe` is a path **relative to `benchmarks/multi_node/srt-slurm-recipes/`** in this repo. The schema validator rejects entries whose recipe file does not exist on disk, so adding a new entry requires upstreaming the recipe yaml here first.
+- The path may carry an `:override[N]` / `:override_<name>` suffix to select a named override section inside an sglang-style recipe yaml (e.g. `"dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[0]"`). The launcher strips this suffix before reading the file but passes the full string to `srtctl`.
+- `recipe` is optional: multi-node entries that do *not* go through srt-slurm (e.g. dynamo-sglang aggregated topologies that drive their own bash) leave it unset.
+- Recipes live under `benchmarks/multi_node/srt-slurm-recipes/` organized as `<model>/<framework>/<hw>-<precision>/<isl><osl>/<agg|disagg>/<stp|mtp>/<recipe-name>.yaml` — e.g. `dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml`. A handful of sglang-style files that carry override sections spanning both stp and mtp are parked one level shallower (the trailing `<stp|mtp>/` segment is omitted). The benchmark template resolves `recipe` to an absolute path and passes it to the launcher as `CONFIG_FILE`, so launchers do not see the relative form.
+
+### Custom-script benchmarking
+
+Recipes are migrating from srt-slurm's bundled `benchmark.type: sa-bench` to `benchmark.type: custom` so the benchmark client lives in this repo (`utils/bench_serving/benchmark_serving.py`) instead of being maintained twice. New shape:
+
+```yaml
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"               # per prefill worker  (filename component)
+    DECODE_GPUS: "8"                # per decode worker   (filename component)
+    TOTAL_GPUS: "20"                # sum across workers  (filename component)
+    # MODEL_NAME: "..."             # only when server's served-model-name
+                                    # differs from master-yaml's `model:`
+    # USE_CHAT_TEMPLATE: "false"    # only when overriding default (true)
+```
+
+`MODEL`, `ISL`, `OSL`, `CONC_LIST`, `DISAGG`, `RANDOM_RANGE_RATIO` are exported by `benchmark-multinode-tmpl.yml` at the workflow step and propagate through the launcher → `srtctl` → `srun` (default `--export=ALL`) → pyxis into the benchmark container, so they don't need to be re-declared in `benchmark.env`. The recipe only carries per-recipe topology knobs (`PREFILL_GPUS`/`DECODE_GPUS`/`TOTAL_GPUS`, used in the result filename) plus the rare overrides (`MODEL_NAME` when the server's served-model-name diverges from `model:`, `USE_CHAT_TEMPLATE: false` for tokenizers that have no chat template, etc.).
+
+`benchmarks/multi_node/srt_bench.sh` is a thin wrapper around `run_benchmark_serving()` in `benchmarks/benchmark_lib.sh` (the same shim every single-node bench script uses). It loops once per concurrency in `$CONC_LIST` and writes results to `/logs/sa-bench_isl_<ISL>_osl_<OSL>/results_concurrency_<N>_gpus_<TOT>_ctx_<P>_gen_<D>.json` so existing launcher result-harvesters pick them up unchanged. Tokenizer is loaded from `/model` — `srtctl`'s `RuntimeContext.create` auto-mounts the model dir at that path in every container, so we don't need any HF Hub egress.
+
+The `container_mounts` block bind-mounts the host-side `$INFMAX_WORKSPACE` (set by the launcher to `$GITHUB_WORKSPACE`) at `/infmax-workspace` inside srt-slurm's benchmark container, so the wrapper and bench client are reachable at known paths. `srtctl` resolves `$INFMAX_WORKSPACE` via `os.path.expandvars` at submission time.
+
 ## Runners
 
 The `runners.yaml` config represents the available runners in the repository. The keys are the runner *types* (i.e., the GPUs as well as some specific combinations like `b200-trt`) whereas the value is a list of *runner nodes*. This config is used to verify the master configs.
diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
index f13b8b6dd..eb8ad8678 100644
--- a/.github/configs/nvidia-master.yaml
+++ b/.github/configs/nvidia-master.yaml
@@ -13,14 +13,12 @@ dsr1-fp4-b200-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [1214]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -28,14 +26,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [875]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -43,14 +39,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [6]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -58,14 +52,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [10, 15, 25, 45, 90, 180]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -73,14 +65,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [ 4968 ]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -88,14 +78,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [10860]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml"
       decode:
         num-worker: 5
         tp: 4
@@ -104,84 +92,72 @@ dsr1-fp4-b200-dynamo-trt:
 
     # Non-MTP configurations
     - conc-list: [4096]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [2192]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [1365]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [6]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [10, 15, 25, 45, 90, 180]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [450]
+      recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -193,14 +169,12 @@ dsr1-fp4-b200-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [90]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -208,14 +182,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [66]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -223,14 +195,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [6]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -238,14 +208,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [10, 15, 30, 60]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -253,14 +221,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [548]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -268,14 +234,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1096, 1691]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml"
       prefill:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -283,14 +247,12 @@ dsr1-fp4-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [658]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -299,84 +261,72 @@ dsr1-fp4-b200-dynamo-trt:
 
     # Non-MTP configurations
     - conc-list: [6]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [10, 15, 25, 50, 100]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [370]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [1606]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [837]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [2222]
+      recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -399,14 +349,12 @@ dsr1-fp8-b200-dynamo-trt:
     # MTP configurations - Low latency (TP attention)
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -414,14 +362,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [32]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -429,14 +375,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [64]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -444,14 +388,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [256]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -460,14 +402,12 @@ dsr1-fp8-b200-dynamo-trt:
     # MTP configurations - High throughput (DP attention)
     - spec-decoding: "mtp"
       conc-list: [896]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -475,14 +415,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1024]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -490,14 +428,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1184]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -505,14 +441,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1600]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -521,42 +455,36 @@ dsr1-fp8-b200-dynamo-trt:
 
     # Non-MTP (STP) configurations - Low latency (TP attention)
     - conc-list: [4]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [32]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [128]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -564,42 +492,36 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     # Non-MTP (STP) configurations - High throughput (DP attention)
     - conc-list: [1920]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [4096]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [5152]
+      recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -612,14 +534,12 @@ dsr1-fp8-b200-dynamo-trt:
     # MTP configurations - Low latency (TP attention)
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -627,14 +547,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -642,14 +560,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [48]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -657,14 +573,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [64]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -673,14 +587,12 @@ dsr1-fp8-b200-dynamo-trt:
     # MTP configurations - High throughput (DP attention)
     - spec-decoding: "mtp"
       conc-list: [224]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -688,14 +600,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [288]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -703,14 +613,12 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1088]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml"
       prefill:
         num-worker: 4
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -719,56 +627,48 @@ dsr1-fp8-b200-dynamo-trt:
 
     # Non-MTP (STP) configurations - Low latency (TP attention)
     - conc-list: [1]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [32]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [128]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [96]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -776,56 +676,48 @@ dsr1-fp8-b200-dynamo-trt:
         dp-attn: false
     # Non-MTP (STP) configurations - High throughput (DP attention)
     - conc-list: [128]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [128]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [256]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [640]
+      recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml
-        - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -848,14 +740,12 @@ dsr1-fp4-b300-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [654]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -863,14 +753,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [271]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -878,14 +766,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [11]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -893,14 +779,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [10, 20, 25, 60, 120, 200]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -908,14 +792,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [2342]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml"
       prefill:
         num-worker: 2
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -923,14 +805,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [8609]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml"
       prefill:
         num-worker: 5
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -938,14 +818,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [12926]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml"
       prefill:
         num-worker: 5
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -954,98 +832,84 @@ dsr1-fp4-b300-dynamo-trt:
 
     # Non-MTP configurations
     - conc-list: [1176]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [6]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [5, 10, 15, 25]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [60, 110, 195, 395]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [4405]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [8192]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [4611]
+      recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -1057,14 +921,12 @@ dsr1-fp4-b300-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [2198]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml"
       prefill:
         num-worker: 10
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1072,14 +934,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [52]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 4
@@ -1087,14 +947,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -1102,14 +960,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [32]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -1117,14 +973,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [181]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1132,14 +986,12 @@ dsr1-fp4-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1197]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml"
       prefill:
         num-worker: 9
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1148,98 +1000,84 @@ dsr1-fp4-b300-dynamo-trt:
 
     # Non-MTP configurations
     - conc-list: [105]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [63]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [4]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [12]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [589]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 5
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [1093]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 6
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [2048]
+      recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 8
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1262,14 +1100,12 @@ dsr1-fp8-b300-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [10]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -1277,14 +1113,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [160]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -1292,14 +1126,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [3072]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1307,14 +1139,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [2560]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -1322,14 +1152,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [720]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -1337,14 +1165,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [11264]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -1355,98 +1181,84 @@ dsr1-fp8-b300-dynamo-trt:
     osl: 1024
     search-space:
     - conc-list: [2112]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [3072]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: true
     - conc-list: [1280]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 1
         dp-attn: true
     - conc-list: [12]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml"
       decode:
         num-worker: 8
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [128]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml"
       decode:
         num-worker: 8
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [384]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml"
       decode:
         num-worker: 8
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [16384]
+      recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1458,14 +1270,12 @@ dsr1-fp8-b300-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [40]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -1473,14 +1283,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -1488,14 +1296,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [20]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -1503,14 +1309,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [72]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1518,14 +1322,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [144]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1533,14 +1335,12 @@ dsr1-fp8-b300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [512]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -1551,98 +1351,84 @@ dsr1-fp8-b300-dynamo-trt:
     osl: 1024
     search-space:
     - conc-list: [64]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [16]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml"
       decode:
         num-worker: 8
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [256]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
     - conc-list: [512]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
     - conc-list: [256]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [1075]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml"
       prefill:
         num-worker: 5
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
     - conc-list: [3072]
+      recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml
-        - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -2676,14 +2462,12 @@ dsr1-fp8-h200-dynamo-trt:
     # MTP configurations
     - spec-decoding: "mtp"
       conc-list: [1]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 11
         tp: 8
@@ -2691,14 +2475,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [4]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 11
         tp: 8
@@ -2706,14 +2488,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 11
         tp: 8
@@ -2721,14 +2501,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [16]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 9
         tp: 8
@@ -2736,14 +2514,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [32]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 11
         tp: 8
@@ -2751,14 +2527,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [64]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 8
         tp: 8
@@ -2766,14 +2540,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [128]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -2781,14 +2553,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [256]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -2796,14 +2566,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [512]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -2811,126 +2579,108 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     # Non-MTP configurations (STP)
     - conc-list: [1]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [4]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [8]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [16]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [32]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [64]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [128]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 9
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [256]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 6
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [512]
+      recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -2942,14 +2692,12 @@ dsr1-fp8-h200-dynamo-trt:
     # MTP configurations
     - spec-decoding: "mtp"
       conc-list: [1]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -2957,14 +2705,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [4]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -2972,14 +2718,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -2987,14 +2731,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [16]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -3002,14 +2744,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [32]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 5
         tp: 8
@@ -3017,14 +2757,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [64]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -3032,14 +2770,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [128]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -3047,14 +2783,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [256]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -3062,14 +2796,12 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [512]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -3077,126 +2809,108 @@ dsr1-fp8-h200-dynamo-trt:
         dp-attn: true
     # Non-MTP configurations (STP)
     - conc-list: [1]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 7
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [4]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 7
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [8]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 6
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [16]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [32]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [64]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [128]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [256]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [512]
+      recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -3219,14 +2933,12 @@ dsr1-fp8-h100-dynamo-trt:
     # MTP configurations
     - spec-decoding: "mtp"
       conc-list: [6]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3234,14 +2946,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [9]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3249,14 +2959,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [30]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3264,14 +2972,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [60]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3279,14 +2985,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [117]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3294,14 +2998,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [231]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3309,14 +3011,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [462]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3324,14 +3024,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [615]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3339,14 +3037,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1229]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3354,126 +3050,108 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: true
     # Non-MTP configurations (STP)
     - conc-list: [6]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [9]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [30]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [60]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [231]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [462]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [924]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [1845]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [4916]
+      recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3485,14 +3163,12 @@ dsr1-fp8-h100-dynamo-trt:
     # MTP configurations (6 points)
     - spec-decoding: "mtp"
       conc-list: [6]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3500,14 +3176,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [9]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3515,14 +3189,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [30]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 16
@@ -3530,14 +3202,12 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [77]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3547,14 +3217,12 @@ dsr1-fp8-h100-dynamo-trt:
     # https://github.com/InferenceMAX/InferenceMAX/actions/runs/21769314582/job/62813105509
     # - spec-decoding: "mtp"
     #   conc-list: [78]
+    #   recipe: "trtllm/h100-fp8/8k1k/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml"
     #   prefill:
     #     num-worker: 1
     #     tp: 16
     #     ep: 16
     #     dp-attn: true
-    #     additional-settings:
-    #     # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml
-    #     - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml"
     #   decode:
     #     num-worker: 2
     #     tp: 16
@@ -3562,14 +3230,12 @@ dsr1-fp8-h100-dynamo-trt:
     #     dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [154]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 2
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3577,70 +3243,60 @@ dsr1-fp8-h100-dynamo-trt:
         dp-attn: true
     # STP configurations (5 points)
     - conc-list: [6]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [9]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [30]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [154]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 16
         ep: 16
         dp-attn: false
     - conc-list: [308]
+      recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 16
         ep: 16
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3860,13 +3516,12 @@ dsr1-fp8-h100-dynamo-sglang:
     search-space:
     # # STP: Max throughput TEP (1 prefill, 2 decode)
     # - conc-list: [1, 2, 4, 8, 16, 32, 64, 128]
+    #   recipe: "h100/1k1k/stp/h100-fp8-1p2d-max-tp.yaml"
     #   prefill:
     #     num-worker: 1
     #     tp: 16
     #     ep: 1
     #     dp-attn: false
-    #     additional-settings:
-    #     - "CONFIG_FILE=recipes/h100/1k1k/stp/h100-fp8-1p2d-max-tp.yaml"
     #   decode:
     #     num-worker: 2
     #     tp: 16
@@ -3874,13 +3529,12 @@ dsr1-fp8-h100-dynamo-sglang:
     #     dp-attn: false
     # # STP: Max throughput DEP (1 prefill, 1 decode, dp-attention)
     # - conc-list: [1, 2, 4, 8, 16, 32, 64]
+    #   recipe: "h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml"
     #   prefill:
     #     num-worker: 1
     #     tp: 16
     #     ep: 1
     #     dp-attn: false
-    #     additional-settings:
-    #     - "CONFIG_FILE=recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml"
     #   decode:
     #     num-worker: 1
     #     tp: 16
@@ -3889,13 +3543,12 @@ dsr1-fp8-h100-dynamo-sglang:
     # MTP: Max throughput TEP (1 prefill, 2 decode)
     - spec-decoding: "mtp"
       conc-list: [1, 2, 4, 8, 16, 32, 64, 128]
+      recipe: "dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h100/1k1k/mtp/h100-fp8-1p2d-max-tp-mtp.yaml"
       decode:
         num-worker: 2
         tp: 16
@@ -3904,13 +3557,12 @@ dsr1-fp8-h100-dynamo-sglang:
     # MTP: Max throughput DEP (1 prefill, 1 decode, dp-attention)
     - spec-decoding: "mtp"
       conc-list: [1, 2, 4, 8, 16, 32, 64]
+      recipe: "dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3921,13 +3573,12 @@ dsr1-fp8-h100-dynamo-sglang:
     search-space:
     # # STP: Max throughput TEP (1 prefill, 1 decode)
     # - conc-list: [1, 2, 4, 8, 16, 32, 64, 128]
+    #   recipe: "h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml"
     #   prefill:
     #     num-worker: 1
     #     tp: 16
     #     ep: 1
     #     dp-attn: false
-    #     additional-settings:
-    #     - "CONFIG_FILE=recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml"
     #   decode:
     #     num-worker: 1
     #     tp: 16
@@ -3935,13 +3586,12 @@ dsr1-fp8-h100-dynamo-sglang:
     #     dp-attn: false
     # # STP: Max throughput DEP (1 prefill, 1 decode, dp-attention)
     # - conc-list: [1, 2, 4, 8, 16, 32, 64]
+    #   recipe: "h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml"
     #   prefill:
     #     num-worker: 1
     #     tp: 16
     #     ep: 1
     #     dp-attn: false
-    #     additional-settings:
-    #     - "CONFIG_FILE=recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml"
     #   decode:
     #     num-worker: 1
     #     tp: 16
@@ -3950,13 +3600,12 @@ dsr1-fp8-h100-dynamo-sglang:
     # MTP: Max throughput TEP (1 prefill, 1 decode)
     - spec-decoding: "mtp"
       conc-list: [1, 2, 4, 8, 16, 32, 64, 128]
+      recipe: "dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -3965,13 +3614,12 @@ dsr1-fp8-h100-dynamo-sglang:
     # MTP: Max throughput DEP (1 prefill, 1 decode, dp-attention)
     - spec-decoding: "mtp"
       conc-list: [1, 2, 4, 8, 16, 32, 64]
+      recipe: "dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 16
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4061,14 +3709,12 @@ dsr1-fp4-gb200-dynamo-trt:
     # MTP configurations (spec_decoding="mtp")
     - spec-decoding: "mtp"
       conc-list: [ 180 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4076,14 +3722,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [ 4, 8, 12, 24, 48 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -4091,14 +3735,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [ 4301 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4106,14 +3748,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [ 2253 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4121,14 +3761,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [ 16130 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml"
       decode:
         num-worker: 5
         tp: 4
@@ -4138,98 +3776,84 @@ dsr1-fp4-gb200-dynamo-trt:
 
     # Non-MTP configurations (default spec_decoding="none")
     - conc-list: [ 4301 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [ 666 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [ 6144 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml"
       decode:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
     - conc-list: [ 12, 24, 48, 96, 192 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [ 5 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [ 4301 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [ 2253 ]
+      recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4242,14 +3866,12 @@ dsr1-fp4-gb200-dynamo-trt:
     # MTP configurations (spec_decoding="mtp")
     - spec-decoding: "mtp"
       conc-list: [ 4, 8, 12, 24, 48 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -4257,14 +3879,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [ 180 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4272,14 +3892,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [ 1229 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4287,14 +3905,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [ 666 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml"
       prefill:
         num-worker: 8
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4302,14 +3918,12 @@ dsr1-fp4-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [ 4301 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml"
       prefill:
         num-worker: 11
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4318,84 +3932,72 @@ dsr1-fp4-gb200-dynamo-trt:
 
     # Non-MTP configurations (default spec_decoding="none")
     - conc-list: [ 12, 44, 76 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [ 5 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [ 333 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [ 1229 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [ 2253 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 8
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [ 4096 ]
+      recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml"
       prefill:
         num-worker: 10
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4419,14 +4021,12 @@ dsr1-fp8-gb200-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [4301]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -4434,14 +4034,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [2151]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -4449,14 +4047,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1229]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4464,14 +4060,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [615]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4479,14 +4073,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [36]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4494,14 +4086,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [18]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4509,14 +4099,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [9]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4524,98 +4112,84 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: false
   # 1k1k STP configs
     - conc-list: [6144]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [4301]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [2151]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [1127]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [256]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [27]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [3]
+      recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4627,14 +4201,12 @@ dsr1-fp8-gb200-dynamo-trt:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [666]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -4642,14 +4214,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [666]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml"
       prefill:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4657,14 +4227,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [333]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -4672,14 +4240,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [333]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml"
       prefill:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4687,14 +4253,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [90]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4702,14 +4266,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [15]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4717,14 +4279,12 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [6]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4732,98 +4292,84 @@ dsr1-fp8-gb200-dynamo-trt:
         dp-attn: false
   # 8k1k STP configs
     - conc-list: [1229]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml"
       prefill:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [666]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml"
       prefill:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [615]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [333]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [63]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [18]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [6]
+      recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -4846,14 +4392,12 @@ dsr1-fp8-gb200-dynamo-sglang:
     search-space:
    # "Low latency" (1 prefill worker at TP4 and 1 decode worker at TP4)
     - conc-list: [4, 8]
+      recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/low-latency.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/1k1k/low-latency.yaml"
       decode:
         num-worker: 1
         tp: 4
@@ -4862,14 +4406,12 @@ dsr1-fp8-gb200-dynamo-sglang:
 
     # "Mid curve" (3 prefill workers at DEP8 and 1 decode worker at DEP48)
     - conc-list: [1024, 2048, 4096]
+      recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/mid-curve.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/1k1k/mid-curve.yaml"
       decode:
         num-worker: 1
         tp: 48
@@ -4878,14 +4420,12 @@ dsr1-fp8-gb200-dynamo-sglang:
 
     # "Max throughput" (2 prefill workers at DEP8 and 1 decode worker at DEP32)
     - conc-list: [1024, 2048, 4096, 6144]
+      recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/max-tpt.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/1k1k/max-tpt.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4894,14 +4434,12 @@ dsr1-fp8-gb200-dynamo-sglang:
 
     # "Ultra throughput" (1 prefill workers at DEP8 and 1 decode worker at DEP8)
     - conc-list: [4096]
+      recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/ultra-tpt.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/1k1k/ultra-tpt.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -4913,14 +4451,12 @@ dsr1-fp8-gb200-dynamo-sglang:
     search-space:
    # "Low latency" (1 prefill worker at TP8 and 1 decode worker at TP8)
     - conc-list: [4, 8, 16]
+      recipe: "dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/8k1k/low-latency.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/8k1k/low-latency.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -4929,14 +4465,12 @@ dsr1-fp8-gb200-dynamo-sglang:
 
     # "Mid curve" (5 prefill workers at DEP8 and 1 decode worker at DEP32)
     - conc-list: [512, 1024, 2048, 6144]
+      recipe: "dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml"
       prefill:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/8k1k/mid-curve.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/8k1k/mid-curve.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -4945,14 +4479,12 @@ dsr1-fp8-gb200-dynamo-sglang:
 
     # "Max throughput" (6 prefill workers at DEP8 and 1 decode worker at DEP24)
     - conc-list: [2048, 4096, 6144]
+      recipe: "dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml"
       prefill:
         num-worker: 6
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/8k1k/max_tpt.yaml
-        - "CONFIG_FILE=recipes/gb200-fp8/8k1k/max_tpt.yaml"
       decode:
         num-worker: 1
         tp: 24
@@ -4974,14 +4506,12 @@ dsr1-fp8-gb300-dynamo-sglang:
     search-space:
    # "Low latency" (1 prefill worker at TP4 and 4 decode workers at TP4)
     - conc-list: [4, 8, 16, 32]
+      recipe: "dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/1k1k/stp/low-latency.yaml
-        - "CONFIG_FILE=recipes/gb300-fp8/1k1k/stp/low-latency.yaml"
       decode:
         num-worker: 4
         tp: 4
@@ -4990,14 +4520,12 @@ dsr1-fp8-gb300-dynamo-sglang:
 
     # "Mid curve" (2 prefill workers at DEP8 and 1 decode worker at DEP32)
     - conc-list: [1024, 2048, 4096, 6144]
+      recipe: "dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/1k1k/stp/mid.yaml
-        - "CONFIG_FILE=recipes/gb300-fp8/1k1k/stp/mid.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5006,14 +4534,12 @@ dsr1-fp8-gb300-dynamo-sglang:
 
     # "Max throughput" (1 prefill worker at DEP8 and 1 decode worker at DEP8)
     - conc-list: [4096, 7168, 7680]
+      recipe: "dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/1k1k/stp/max.yaml
-        - "CONFIG_FILE=recipes/gb300-fp8/1k1k/stp/max.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -5025,14 +4551,12 @@ dsr1-fp8-gb300-dynamo-sglang:
     search-space:
    # "Low latency" (1 prefill worker at TP4 and 1 decode worker at TP4)
     - conc-list: [4, 8]
+      recipe: "dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/8k1k/stp/low-latency.yaml
-        - "CONFIG_FILE=recipes/gb300-fp8/8k1k/stp/low-latency.yaml"
       decode:
         num-worker: 1
         tp: 4
@@ -5041,14 +4565,12 @@ dsr1-fp8-gb300-dynamo-sglang:
 
     # "Mid curve" (5 prefill workers at DEP8 and 1 decode worker at DEP32)
     - conc-list: [128, 256, 512, 1024]
+      recipe: "dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml"
       prefill:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/8k1k/stp/mid.yaml
-        - "CONFIG_FILE=recipes/gb300-fp8/8k1k/stp/mid.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5057,14 +4579,12 @@ dsr1-fp8-gb300-dynamo-sglang:
 
     # "Max throughput" (6 prefill workers at DEP8 and 1 decode worker at DEP24)
     - conc-list: [2048, 4096]
+      recipe: "dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml"
       prefill:
         num-worker: 6
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/8k1k/stp/max.yaml
-        - "CONFIG_FILE=recipes/gb300-fp8/8k1k/stp/max.yaml"
       decode:
         num-worker: 1
         tp: 24
@@ -5088,13 +4608,12 @@ dsr1-fp4-gb200-dynamo-sglang:
     # Low latency (1 prefill node, 2 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 4, 8, 32 ]
+      recipe: "dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb200-fp4/1k1k/low-latency.yaml"
       decode:
         num-worker: 2
         tp: 4
@@ -5104,13 +4623,12 @@ dsr1-fp4-gb200-dynamo-sglang:
     # Mid curve (4 prefill nodes, 8 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 512, 2048, 4096, 8192 ]
+      recipe: "dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb200-fp4/1k1k/mid-curve.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5120,13 +4638,12 @@ dsr1-fp4-gb200-dynamo-sglang:
     # Max throughput (4 prefill nodes, 12 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 2048, 4096 ]
+      recipe: "dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb200-fp4/1k1k/max-tpt.yaml"
       decode:
         num-worker: 1
         tp: 48
@@ -5140,13 +4657,12 @@ dsr1-fp4-gb200-dynamo-sglang:
     # Low latency (1 prefill node, 4 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 4, 8 ]
+      recipe: "dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb200-fp4/8k1k/low-latency.yaml"
       decode:
         num-worker: 4
         tp: 4
@@ -5156,13 +4672,12 @@ dsr1-fp4-gb200-dynamo-sglang:
     # Mid curve (6 prefill nodes, 12 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 512, 2048, 4096 ]
+      recipe: "dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml"
       prefill:
         num-worker: 6
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb200-fp4/8k1k/mid-curve.yaml"
       decode:
         num-worker: 1
         tp: 48
@@ -5172,13 +4687,12 @@ dsr1-fp4-gb200-dynamo-sglang:
     # Max throughput (10 prefill nodes, 8 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 2048 ]
+      recipe: "dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml"
       prefill:
         num-worker: 10
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb200-fp4/8k1k/max-tpt.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5201,14 +4715,12 @@ dsr1-fp4-gb300-dynamo-trt:
     # MTP configurations
     - spec-decoding: "mtp"
       conc-list: [3226]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 4
@@ -5216,14 +4728,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [333]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5231,14 +4741,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [5]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5246,14 +4754,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8, 12, 24, 48]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5261,14 +4767,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [2253]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -5276,14 +4780,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5291,84 +4793,72 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     # Non-MTP configurations (default spec_decoding="none")
     - conc-list: [5]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [12, 48, 96, 192]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [8192]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [4301]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [2253]
+      recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 3
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5380,14 +4870,12 @@ dsr1-fp4-gb300-dynamo-trt:
     # MTP configurations (spec_decoding="mtp")
     - spec-decoding: "mtp"
       conc-list: [33]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -5395,14 +4883,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [5]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5410,14 +4896,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [12, 24]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5425,14 +4909,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [180]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml"
       prefill:
         num-worker: 4
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5440,14 +4922,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [308]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml"
       prefill:
         num-worker: 8
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5455,14 +4935,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [2253]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml"
       prefill:
         num-worker: 10
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -5470,14 +4948,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [666]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml"
       prefill:
         num-worker: 10
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -5485,14 +4961,12 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1127]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml"
       prefill:
         num-worker: 13
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -5500,112 +4974,96 @@ dsr1-fp4-gb300-dynamo-trt:
         dp-attn: true
     # Non-MTP configurations (default spec_decoding="none")
     - conc-list: [72]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [5]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [12]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [5, 15, 30]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [666]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 7
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 9
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [3228]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 11
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
     - conc-list: [2253]
+      recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 14
         tp: 2
         ep: 2
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -5629,13 +5087,12 @@ dsr1-fp4-gb300-dynamo-sglang:
     # Low latency (1 prefill node, 2 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 4, 8, 32 ]
+      recipe: "dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb300-fp4/1k1k/low_latency.yaml"
       decode:
         num-worker: 2
         tp: 4
@@ -5645,13 +5102,12 @@ dsr1-fp4-gb300-dynamo-sglang:
     # Mid curve (4 prefill nodes, 8 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 512, 2048, 4096, 8192 ]
+      recipe: "dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb300-fp4/1k1k/mid_curve.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5661,13 +5117,12 @@ dsr1-fp4-gb300-dynamo-sglang:
     # Max throughput (4 prefill nodes, 12 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 512, 2048, 4096, 8192 ]
+      recipe: "dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb300-fp4/1k1k/max_tpt.yaml"
       decode:
         num-worker: 1
         tp: 48
@@ -5681,13 +5136,12 @@ dsr1-fp4-gb300-dynamo-sglang:
     # Low latency (1 prefill node, 4 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 4, 8, 32, 64 ]
+      recipe: "dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb300-fp4/8k1k/low_latency.yaml"
       decode:
         num-worker: 4
         tp: 4
@@ -5697,13 +5151,12 @@ dsr1-fp4-gb300-dynamo-sglang:
     # Mid curve (6 prefill nodes, 12 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 512, 2048, 4096 ]
+      recipe: "dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml"
       prefill:
         num-worker: 6
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb300-fp4/8k1k/mid_curve.yaml"
       decode:
         num-worker: 1
         tp: 48
@@ -5713,13 +5166,12 @@ dsr1-fp4-gb300-dynamo-sglang:
     # Max throughput (10 prefill nodes, 8 decode nodes)
     - spec-decoding: "none"
       conc-list: [ 2048 ]
+      recipe: "dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml"
       prefill:
         num-worker: 10
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/gb300-fp4/8k1k/max_tpt.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5742,14 +5194,12 @@ dsr1-fp8-gb300-dynamo-trt:
     # MTP configurations (spec_decoding="mtp")
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5757,14 +5207,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [24]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5772,14 +5220,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [180]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5787,14 +5233,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [564]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5802,14 +5246,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [666]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -5817,14 +5259,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [2253]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -5832,14 +5272,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [8192]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -5847,98 +5285,84 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     # STP configurations (no spec_decoding)
     - conc-list: [4]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [24]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [84]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [2253]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [8602]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [12288]
+      recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -5950,14 +5374,12 @@ dsr1-fp8-gb300-dynamo-trt:
     # MTP configurations (spec_decoding="mtp")
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5965,14 +5387,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [24]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -5980,14 +5400,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [333]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml"
       prefill:
         num-worker: 6
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -5995,14 +5413,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [666]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml"
       prefill:
         num-worker: 8
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -6010,14 +5426,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml"
       prefill:
         num-worker: 10
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -6025,14 +5439,12 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -6040,98 +5452,84 @@ dsr1-fp8-gb300-dynamo-trt:
         dp-attn: true
     # STP configurations (no spec_decoding)
     - conc-list: [4]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [24]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [36]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [512]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml"
       prefill:
         num-worker: 6
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [666]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml"
       prefill:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [1229]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [2151]
+      recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml
-        - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -6424,13 +5822,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: Low latency (1 prefill, 9 decode, TEP)
     - spec-decoding: "none"
       conc-list: [1, 4, 8, 16, 32, 64, 128, 256]
+      recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/1k1k/low-latency-1p9d.yaml"
       decode:
         num-worker: 9
         tp: 8
@@ -6439,13 +5836,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: High throughput TEP (1 prefill, 6 decode)
     - spec-decoding: "none"
       conc-list: [512, 1024, 2048]
+      recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-tp.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6454,13 +5850,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: High throughput DEP (1 prefill, 6 decode, dp-attention)
     - spec-decoding: "none"
       conc-list: [128, 256, 512, 1024, 2048]
+      recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-dep.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6469,13 +5864,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: Low latency (1 prefill, 9 decode, TEP)
     - spec-decoding: "mtp"
       conc-list: [1, 4, 8, 16, 32, 64, 128, 256]
+      recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/1k1k/low-latency-1p9d-mtp.yaml"
       decode:
         num-worker: 9
         tp: 8
@@ -6484,13 +5878,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: High throughput TEP (1 prefill, 6 decode)
     - spec-decoding: "mtp"
       conc-list: [512, 1024, 2048]
+      recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6499,13 +5892,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: High throughput DEP (1 prefill, 6 decode, dp-attention)
     - spec-decoding: "mtp"
       conc-list: [128, 256, 512, 1024, 2048]
+      recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-dep-mtp.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6517,13 +5909,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: Low latency TEP (1 prefill, 7 decode)
     - spec-decoding: "none"
       conc-list: [1, 4, 8]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs4-1p7d.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -6532,13 +5923,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: TEP (1 prefill, 6 decode)
     - spec-decoding: "none"
       conc-list: [4, 8, 16]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs8-1p6d.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6547,13 +5937,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: TEP (1 prefill, 3 decode)
     - spec-decoding: "none"
       conc-list: [8, 16, 32]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs16-1p3d.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -6562,13 +5951,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: TEP (2 prefill, 3 decode)
     - spec-decoding: "none"
       conc-list: [32, 64, 128]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs64-2p3d.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -6577,13 +5965,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # STP: High throughput DEP (1 prefill, 1 decode, dp-attention)
     - spec-decoding: "none"
       conc-list: [64, 128, 256]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs128-1p1d-dep.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -6592,13 +5979,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: Low latency TEP (1 prefill, 7 decode)
     - spec-decoding: "mtp"
       conc-list: [1, 4, 8]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs4-1p7d-mtp.yaml"
       decode:
         num-worker: 7
         tp: 8
@@ -6607,13 +5993,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: TEP (1 prefill, 6 decode)
     - spec-decoding: "mtp"
       conc-list: [2, 4, 8, 16, 32]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs8-1p6d-mtp.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6622,13 +6007,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: TEP (1 prefill, 3 decode)
     - spec-decoding: "mtp"
       conc-list: [4, 8, 16, 32, 64]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs16-1p3d-mtp.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -6637,13 +6021,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: TEP (2 prefill, 3 decode)
     - spec-decoding: "mtp"
       conc-list: [32, 64, 128]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs64-2p3d-mtp.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -6652,13 +6035,12 @@ dsr1-fp8-h200-dynamo-sglang:
     # MTP: High throughput DEP (1 prefill, 1 decode, dp-attention)
     - spec-decoding: "mtp"
       conc-list: [32, 64, 128, 256, 512]
+      recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -6680,52 +6062,48 @@ dsr1-fp4-b200-dynamo-sglang:
     search-space:
     # Non-MTP configurations
     - conc-list: [16, 128]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[0]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_lowlat[0]"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [32, 64, 256]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[1]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_lowlat[1]"
       decode:
         num-worker: 6
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [512]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[0]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_maxtpt[0]"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [512]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[1]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_maxtpt[1]"
       decode:
         num-worker: 2
         tp: 8
@@ -6736,65 +6114,60 @@ dsr1-fp4-b200-dynamo-sglang:
     search-space:
     # Non-MTP configurations
     - conc-list: [64, 128]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_stp_lowlat[0]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[0]"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [8]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_stp_lowlat[1]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[1]"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [4, 128]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_stp_lowlat[2]"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[2]"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [4, 8, 16, 64]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:override_stp_tp4"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:override_stp_tp4"
       decode:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [1024, 2048]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:override_stp_maxtpt_7p2d"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:override_stp_maxtpt_7p2d"
       decode:
         num-worker: 2
         tp: 8
@@ -6816,52 +6189,48 @@ dsr1-fp8-b200-dynamo-sglang:
     search-space:
     # Non-MTP configurations
     - conc-list: [4]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[0]"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_lowlat[0]"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [16, 32, 64, 128, 256]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[1]"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_lowlat[1]"
       decode:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [1024, 2048, 4096]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[0]"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_maxtpt[0]"
       decode:
         num-worker: 5
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [2048, 4096]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[1]"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_maxtpt[1]"
       decode:
         num-worker: 5
         tp: 8
@@ -6872,42 +6241,36 @@ dsr1-fp8-b200-dynamo-sglang:
     search-space:
     # STP low-latency: resolved from 8k1k.yaml zip_override_stp_lowlat
     - conc-list: [128]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_lowlat_0.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_lowlat_0.yaml"
       decode:
         num-worker: 3
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [128]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_lowlat_1.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_lowlat_1.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 1
         dp-attn: false
     - conc-list: [8, 16, 32, 64, 128]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_lowlat_2.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_lowlat_2.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -6915,56 +6278,48 @@ dsr1-fp8-b200-dynamo-sglang:
         dp-attn: false
     # STP max-throughput: resolved from 8k1k.yaml zip_override_stp_maxtpt
     - conc-list: [288]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_0.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_0.yaml"
       decode:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [160, 288]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_1.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_1.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [512]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_2.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_2.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [1024]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_3.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_3.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -6987,13 +6342,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP low-latency: 1P1D
     - spec-decoding: "mtp"
       conc-list: [4, 64]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[0]"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_lowlat[0]"
       decode:
         num-worker: 1
         tp: 8
@@ -7002,13 +6356,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP low-latency: 1P3D
     - spec-decoding: "mtp"
       conc-list: [4, 8, 16, 32, 128]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[1]"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: false
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_lowlat[1]"
       decode:
         num-worker: 3
         tp: 8
@@ -7017,13 +6370,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP max-tpt: 1P5D
     - spec-decoding: "mtp"
       conc-list: [512, 4096]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[1]"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_maxtpt[1]"
       decode:
         num-worker: 5
         tp: 8
@@ -7032,13 +6384,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP max-tpt: 2P5D
     - spec-decoding: "mtp"
       conc-list: [1024, 2048, 4096]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[2]"
       prefill:
         num-worker: 2
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_maxtpt[2]"
       decode:
         num-worker: 5
         tp: 8
@@ -7047,13 +6398,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP max-tpt: 1P2D
     - spec-decoding: "mtp"
       conc-list: [512, 1024, 2048]
+      recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:override_mtp_maxtpt_1p2d"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:override_mtp_maxtpt_1p2d"
       decode:
         num-worker: 2
         tp: 8
@@ -7065,14 +6415,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP low-latency: resolved from 8k1k.yaml zip_override_mtp_lowlat
     - spec-decoding: "mtp"
       conc-list: [128]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_lowlat_0.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_lowlat_0.yaml"
       decode:
         num-worker: 3
         tp: 8
@@ -7080,14 +6428,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [128]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_lowlat_1.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_lowlat_1.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -7095,14 +6441,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8, 16, 32, 64, 128]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_lowlat_2.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_lowlat_2.yaml"
       decode:
         num-worker: 6
         tp: 8
@@ -7111,14 +6455,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
     # MTP max-throughput: resolved from 8k1k.yaml zip_override_mtp_maxtpt
     - spec-decoding: "mtp"
       conc-list: [288]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_0.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_0.yaml"
       decode:
         num-worker: 2
         tp: 8
@@ -7126,14 +6468,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [160, 288]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_1.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_1.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -7141,14 +6481,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [512]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml"
       prefill:
         num-worker: 2
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_2.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_2.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -7156,14 +6494,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [1024]
+      recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 1
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_3.yaml
-        - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_3.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -7185,14 +6521,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [16, 512]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[0]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_lowlat[0]"
       decode:
         num-worker: 5
         tp: 8
@@ -7200,14 +6534,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [32, 64, 256, 512]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[1]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_lowlat[1]"
       decode:
         num-worker: 6
         tp: 8
@@ -7215,14 +6547,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [512, 1024]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[0]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_maxtpt[0]"
       decode:
         num-worker: 1
         tp: 8
@@ -7230,14 +6560,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
         dp-attn: true
     - spec-decoding: "mtp"
       conc-list: [512]
+      recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[1]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_maxtpt[1]"
       decode:
         num-worker: 2
         tp: 8
@@ -7251,14 +6579,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
     search-space:
     - spec-decoding: "mtp"
       conc-list: [64, 128]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_mtp_lowlat[0]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_mtp_lowlat[0]"
       decode:
         num-worker: 1
         tp: 8
@@ -7266,14 +6592,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [8]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_mtp_lowlat[1]"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_mtp_lowlat[1]"
       decode:
         num-worker: 5
         tp: 8
@@ -7281,14 +6605,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [4, 128]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_mtp_lowlat[2]"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_mtp_lowlat[2]"
       decode:
         num-worker: 5
         tp: 8
@@ -7296,14 +6618,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp:
         dp-attn: false
     - spec-decoding: "mtp"
       conc-list: [4, 8, 16, 64]
+      recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:override_mtp_tp4"
       prefill:
         num-worker: 1
         tp: 4
         ep: 1
         dp-attn: false
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml
-        - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:override_mtp_tp4"
       decode:
         num-worker: 1
         tp: 8
@@ -7325,98 +6645,84 @@ kimik2.5-fp4-gb200-dynamo-trt:
     search-space:
     # Non-MTP configurations (default spec_decoding="none")
     - conc-list: [ 4, 192, 360, 668 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [ 5, 15, 30, 55 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [ 666 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [ 2253 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
         ep: 32
         dp-attn: true
     - conc-list: [ 4301, 6452 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [ 4301 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [ 4301 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 32
@@ -7428,98 +6734,84 @@ kimik2.5-fp4-gb200-dynamo-trt:
     search-space:
     # Non-MTP configurations (default spec_decoding="none")
     - conc-list: [ 4 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 8
         ep: 8
         dp-attn: false
     - conc-list: [ 156 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [ 5, 15, 30, 60, 105 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [ 333 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml"
       prefill:
         num-worker: 2
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [ 615 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [ 2151 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml"
       prefill:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [ 2253 ]
+      recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml"
       prefill:
         num-worker: 7
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
-        - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -7540,28 +6832,24 @@ kimik2.5-fp4-gb200-dynamo-vllm:
     osl: 1024
     search-space:
     - conc-list: [256, 512, 1024, 2048, 3072, 4096]
+      recipe: "kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml
-        - "CONFIG_FILE=recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [4, 8, 16, 32, 64, 128]
+      recipe: "kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml
-        - "CONFIG_FILE=recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml"
       decode:
         num-worker: 4
         tp: 4
@@ -7571,56 +6859,48 @@ kimik2.5-fp4-gb200-dynamo-vllm:
     osl: 1024
     search-space:
     - conc-list: [4, 8, 16, 32, 128]
+      recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml"
       prefill:
         num-worker: 1
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml
-        - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml"
       decode:
         num-worker: 4
         tp: 4
         ep: 4
         dp-attn: false
     - conc-list: [512, 1024]
+      recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml"
       prefill:
         num-worker: 3
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml
-        - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml"
       decode:
         num-worker: 1
         tp: 16
         ep: 16
         dp-attn: true
     - conc-list: [2048]
+      recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml"
       prefill:
         num-worker: 5
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml
-        - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml"
       decode:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
     - conc-list: [3072, 4096]
+      recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml"
       prefill:
         num-worker: 6
         tp: 4
         ep: 4
         dp-attn: true
-        additional-settings:
-        # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml
-        - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml"
       decode:
         num-worker: 1
         tp: 16
@@ -7647,13 +6927,12 @@ dsv4-fp4-gb200-dynamo-vllm:
     # Low latency: 1 prefill (DEP=8) + 1 decode (TP=8). 5 nodes total with
     # a dedicated NATS/etcd infra node.
     - conc-list: [1]
+      recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -7663,13 +6942,12 @@ dsv4-fp4-gb200-dynamo-vllm:
     # Low-middle curve: 1 prefill (DEP=8) + 4 decode (TP=8). 11 nodes total
     # with a dedicated NATS/etcd infra node.
     - conc-list: [256, 512]
+      recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml"
       decode:
         num-worker: 4
         tp: 8
@@ -7679,13 +6957,12 @@ dsv4-fp4-gb200-dynamo-vllm:
     # Mid curve: 1 prefill (DEP=8) + 1 decode (DEP=8). 5 nodes total with
     # a dedicated NATS/etcd infra node.
     - conc-list: [256]
+      recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml"
       prefill:
         num-worker: 1
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -7695,13 +6972,12 @@ dsv4-fp4-gb200-dynamo-vllm:
     # Max throughput: 3 prefill (DEP=8 each) + 1 decode (DEP=8). 9 nodes
     # total with a dedicated NATS/etcd infra node.
     - conc-list: [4096]
+      recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml"
       decode:
         num-worker: 1
         tp: 8
@@ -7711,13 +6987,12 @@ dsv4-fp4-gb200-dynamo-vllm:
     # MegaMOE max throughput: same 3 prefill (DEP=8 each) + 1 decode (DEP=8)
     # shape, but uses deep_gemm_mega_moe on both workers and disables offload.
     - conc-list: [4096]
+      recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml"
       prefill:
         num-worker: 3
         tp: 8
         ep: 8
         dp-attn: true
-        additional-settings:
-        - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml"
       decode:
         num-worker: 1
         tp: 8
diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml
index 75036a986..a8005096b 100644
--- a/.github/workflows/benchmark-multinode-tmpl.yml
+++ b/.github/workflows/benchmark-multinode-tmpl.yml
@@ -77,6 +77,11 @@ on:
         required: false
         type: string
         default: "[]"
+      recipe:
+        description: "Path under benchmarks/multi_node/srt-slurm-recipes/ identifying the srt-slurm recipe to dispatch. May carry an `:override[N]` suffix. Empty for non-srt-slurm multi-node configs."
+        required: false
+        type: string
+        default: ""
       run-eval:
         type: boolean
         required: false
@@ -165,6 +170,7 @@ jobs:
         env:
           RUNNER_NAME: ${{ runner.name }}
           RUNNER_TYPE: ${{ inputs.runner }}
+          RECIPE: ${{ inputs.recipe }}
           # Hash uniquely on {EXP_NAME}_{PRECISION}_{FRAMEWORK}_prefill-tp{}-ep{}-dp{}-nw{}_decode-tp{}-ep{}-dp{}-nw{}_disagg-{}_spec-{}_conc{}_{runner}
           RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_prefill-tp${{ env.PREFILL_TP }}-ep${{ env.PREFILL_EP }}-dp${{ env.PREFILL_DP_ATTN }}-nw${{ env.PREFILL_NUM_WORKERS }}_decode-tp${{ env.DECODE_TP }}-ep${{ env.DECODE_EP }}-dp${{ env.DECODE_DP_ATTN }}-nw${{ env.DECODE_NUM_WORKERS }}_disagg-${{ env.DISAGG }}_spec-${{ env.SPEC_DECODING }}_conc${{ join(fromJson(inputs.conc-list), 'x') }}_${{ runner.name }}
         run: |
@@ -173,6 +179,15 @@ jobs:
           echo "RESULT_FILENAME=${RESULT_FILENAME}" >> $GITHUB_ENV
 
           export ${{ join(fromJson(inputs.prefill-additional-settings), ' ') }} ${{ join(fromJson(inputs.decode-additional-settings), ' ') }}
+          # RECIPE = "<path>[:override[N]]" relative to benchmarks/multi_node/srt-slurm-recipes/.
+          # Copy the file to scratch so the launcher's `sed -i` rewrites don't mutate the
+          # tracked recipe between concurrent runs; preserve any :override suffix verbatim.
+          if [[ -n "$RECIPE" ]]; then
+            src="${GITHUB_WORKSPACE}/benchmarks/multi_node/srt-slurm-recipes/${RECIPE%%:*}"
+            scratch="$(mktemp -d)/$(basename "${RECIPE%%:*}")"
+            cp "$src" "$scratch"
+            export CONFIG_FILE="${scratch}${RECIPE#"${RECIPE%%:*}"}"
+          fi
           export IS_MULTINODE=true
           bash ./runners/launch_${RUNNER_NAME%%_*}.sh
           if [ "${{ inputs.eval-only }}" = "true" ]; then
diff --git a/.github/workflows/e2e-tests.yml b/.github/workflows/e2e-tests.yml
index 74d4889f3..f8961f7b4 100644
--- a/.github/workflows/e2e-tests.yml
+++ b/.github/workflows/e2e-tests.yml
@@ -102,6 +102,7 @@ jobs:
             decode-ep: ${{ matrix.config.decode.ep }}
             decode-dp-attn: ${{ matrix.config.decode.dp-attn }}
             decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }}
+            recipe: ${{ matrix.config.recipe }}
             run-eval: false
             ref: ${{ inputs.ref }}
 
@@ -141,6 +142,7 @@ jobs:
             decode-ep: ${{ matrix.config.decode.ep }}
             decode-dp-attn: ${{ matrix.config.decode.dp-attn }}
             decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }}
+            recipe: ${{ matrix.config.recipe }}
             run-eval: true
             eval-only: true
             eval-conc: ${{ matrix.config.eval-conc }}
diff --git a/.github/workflows/run-sweep.yml b/.github/workflows/run-sweep.yml
index fd1fa91be..4dea7065a 100644
--- a/.github/workflows/run-sweep.yml
+++ b/.github/workflows/run-sweep.yml
@@ -138,6 +138,7 @@ jobs:
             decode-ep: ${{ matrix.config.decode.ep }}
             decode-dp-attn: ${{ matrix.config.decode.dp-attn }}
             decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }}
+            recipe: ${{ matrix.config.recipe }}
             run-eval: false
 
     sweep-multi-node-8k1k:
@@ -257,6 +258,7 @@ jobs:
             decode-ep: ${{ matrix.config.decode.ep }}
             decode-dp-attn: ${{ matrix.config.decode.dp-attn }}
             decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }}
+            recipe: ${{ matrix.config.recipe }}
             run-eval: true
             eval-only: true
             eval-conc: ${{ matrix.config.eval-conc }}
diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh
index 268745735..e1d94b1a6 100644
--- a/benchmarks/benchmark_lib.sh
+++ b/benchmarks/benchmark_lib.sh
@@ -206,6 +206,13 @@ run_benchmark_serving() {
     local dsv4=false
     local trust_remote_code=false
     local server_pid=""
+    # Optional --tokenizer / --endpoint pass-throughs for the multi-node
+    # srt_bench.sh. --tokenizer points the bench at the /model auto-mount
+    # (avoids relying on --model being a HF-resolvable id). --endpoint lets
+    # recipes target /v1/chat/completions when chat-template-only request
+    # paths are required.
+    local tokenizer=""
+    local endpoint=""
 
     while [[ $# -gt 0 ]]; do
         case $1 in
@@ -270,6 +277,14 @@ run_benchmark_serving() {
                 server_pid="$2"
                 shift 2
                 ;;
+            --tokenizer)
+                tokenizer="$2"
+                shift 2
+                ;;
+            --endpoint)
+                endpoint="$2"
+                shift 2
+                ;;
             *)
                 echo "Unknown parameter: $1"
                 return 1
@@ -356,7 +371,15 @@ run_benchmark_serving() {
         --result-dir "$result_dir"
         --result-filename "$result_filename.json"
     )
-    
+
+    # Optional pass-throughs.
+    if [[ -n "$tokenizer" ]]; then
+        benchmark_cmd+=(--tokenizer "$tokenizer")
+    fi
+    if [[ -n "$endpoint" ]]; then
+        benchmark_cmd+=(--endpoint "$endpoint")
+    fi
+
     # Add --use-chat-template if requested
     if [[ "$use_chat_template" == true ]]; then
         benchmark_cmd+=(--use-chat-template)
@@ -862,3 +885,72 @@ run_eval() {
     fi
     return $eval_rc
 }
+
+# --------------------------------
+# Container helpers
+# --------------------------------
+
+# Sanitize a container image reference (e.g. "lmsysorg/sglang:v0.5.8-cu130")
+# into a filename-safe slug by replacing /, :, @, # with the chosen separator.
+# Defaults to '_' (most clusters); pass '+' for clusters that adopted that
+# convention for their squash-file directory.
+sanitize_image_filename() {
+    local image="$1"
+    local sep="${2:-_}"
+    echo "$image" | sed "s|[/:@#]|${sep}|g"
+}
+
+# --------------------------------
+# srt-slurm helpers
+# --------------------------------
+
+# Clone srt-slurm and install `srtctl` into a uv venv. After this returns
+# successfully, cwd is the cloned repo and the venv is active. Idempotent on
+# uv: skips re-curl if the binary is already present at $UV_INSTALL_DIR.
+#
+# The srt-slurm commit is pinned (not env-var overridable) so every benchmark
+# run uses the exact same srtctl. To bump it, edit the `ref=` line below.
+#
+# All other inputs are env vars (set before calling); all are optional:
+#   SRT_REPO_DIR    default srt-slurm (relative to current cwd)
+#   UV_INSTALL_DIR  default $HOME/.local/bin (uv's own default)
+#   UV_VENV_DIR     default .venv (inside the cloned repo)
+clone_and_install_srtctl() {
+    local repo_url="https://github.com/NVIDIA/srt-slurm.git"
+    # Pinned to NVIDIA/srt-slurm@main — currently 1372a10. Includes:
+    #   * #110 nginx-rework-ulimit: gates `ulimit -n 1048576` + worker_rlimit_nofile
+    #     behind opt-in `frontend.nginx_raise_ulimit` (we don't opt in).
+    #   * #111 srun command line log demoted INFO -> DEBUG (5KB fingerprint
+    #     heredoc no longer dominates orchestrator log).
+    local ref="1372a10c493e3fd757f342d8516a5a91c30fe6ce"
+    local repo_dir="${SRT_REPO_DIR:-srt-slurm}"
+    local uv_install_dir="${UV_INSTALL_DIR:-${HOME}/.local/bin}"
+    local uv_venv_dir="${UV_VENV_DIR:-.venv}"
+
+    echo "Cloning ${repo_url}@${ref} into ${repo_dir}..."
+    rm -rf "$repo_dir"
+    git clone "$repo_url" "$repo_dir"
+    cd "$repo_dir" || return 1
+    git checkout "$ref"
+
+    echo "Installing uv + srtctl into venv at ${uv_venv_dir}..."
+    export UV_INSTALL_DIR="$uv_install_dir"
+    mkdir -p "$uv_install_dir"
+    if ! [ -x "$uv_install_dir/uv" ]; then
+        curl -LsSf https://astral.sh/uv/install.sh | sh
+    fi
+    export PATH="$uv_install_dir:$PATH"
+    # uv's installer drops an `env` script next to the binary; source it so
+    # PATH/PS1 changes pick up in shells that don't re-read the env.
+    [ -f "$uv_install_dir/env" ] && source "$uv_install_dir/env"
+
+    uv venv "$uv_venv_dir"
+    # shellcheck disable=SC1091
+    source "$uv_venv_dir/bin/activate"
+    uv pip install -e .
+
+    if ! command -v srtctl &> /dev/null; then
+        echo "Error: Failed to install srtctl" >&2
+        return 1
+    fi
+}
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml
new file mode 100644
index 000000000..b08193bcb
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml
@@ -0,0 +1,259 @@
+# B200-FP4 1k1k — STP and MTP in one file
+#
+# Two inference modes distinguished by override key names:
+#   zip_override_stp_*  — standard token prediction (no speculative decoding)
+#   zip_override_mtp_*  — multi-token prediction (EAGLE speculative decoding)
+#
+# Low-latency variants: tep8 decode (DP=1), dep4 prefill (DP=4 TP=4)
+# Max-throughput variants: dep8 decode (DP=8), adds SGLANG_MOE_NVFP4_DISPATCH
+#
+# Note: max-tpt 1d has max-running-requests=1024; max-tpt 2d keeps 512.
+#       MTP max-tpt 1d additionally uses mem-fraction=0.75 for decode.
+#
+# Usage:
+#   srtctl apply  -f recipes/b200-fp4/1k1k.yaml                              # all 8 variants
+#   srtctl apply  -f recipes/b200-fp4/1k1k.yaml:*stp*                        # all STP variants
+#   srtctl apply  -f recipes/b200-fp4/1k1k.yaml:*mtp*                        # all MTP variants
+#   srtctl apply  -f recipes/b200-fp4/1k1k.yaml:zip_override_stp_lowlat[0]   # STP 1p5d only
+#   srtctl dry-run -f recipes/b200-fp4/1k1k.yaml                             # preview
+
+base:
+  name: "b200-fp4-stp-1k1k"
+
+  model:
+    path: "dsr1"
+    container: "dynamo-sglang"
+    precision: "fp4"
+
+  resources:
+    gpu_type: "b200"
+    prefill_nodes: 1
+    prefill_workers: 1
+    gpus_per_prefill: 4
+    decode_nodes: 5
+    decode_workers: 5
+    gpus_per_node: 8
+
+  backend:
+    prefill_environment:
+      TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+      PYTHONUNBUFFERED: "1"
+      DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+      SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+      SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+      SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+      SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+      SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+      SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+      SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+      MC_FORCE_MNNVL: "1"
+      NCCL_MNNVL_ENABLE: "1"
+      NCCL_CUMEM_ENABLE: "1"
+      SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+      DYN_REQUEST_PLANE: nats
+    decode_environment:
+      TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+      PYTHONUNBUFFERED: "1"
+      DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+      SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+      SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+      SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+      SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+      SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+      SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+      SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+      SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+      MC_FORCE_MNNVL: "1"
+      NCCL_MNNVL_ENABLE: "1"
+      NCCL_CUMEM_ENABLE: "1"
+      SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+      DYN_REQUEST_PLANE: nats
+    sglang_config:
+      prefill:
+        # Model configuration
+        served-model-name: "deepseek-ai/DeepSeek-R1"
+        trust-remote-code: true
+        quantization: "modelopt_fp4"
+
+        # Disaggregation mode
+        disaggregation-mode: "prefill"
+        disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+        mem-fraction-static: 0.85
+        max-prefill-tokens: 32768
+        chunked-prefill-size: 32768
+        context-length: 2200
+        max-running-requests: 512
+        disable-cuda-graph: true
+
+        # Parallelism
+        tensor-parallel-size: 4
+        data-parallel-size: 4
+        expert-parallel-size: 4
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+
+        # Attention
+        attention-backend: "trtllm_mla"
+        kv-cache-dtype: "fp8_e4m3"
+
+        # MoE
+        moe-runner-backend: "flashinfer_trtllm"
+        moe-dense-tp-size: 1
+
+        # Other flags
+        stream-interval: 30
+        watchdog-timeout: 1000000
+        enable-flashinfer-allreduce-fusion: true
+        disable-radix-cache: true
+
+      decode:
+        # Model configuration
+        served-model-name: "deepseek-ai/DeepSeek-R1"
+        trust-remote-code: true
+        quantization: "modelopt_fp4"
+
+        # Disaggregation mode
+        disaggregation-mode: "decode"
+        disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+        mem-fraction-static: 0.85
+        max-prefill-tokens: 32768
+        chunked-prefill-size: 32768
+        context-length: 2200
+        max-running-requests: 512
+        cuda-graph-max-bs: 512
+
+        # Parallelism
+        tensor-parallel-size: 8
+        data-parallel-size: 1
+        expert-parallel-size: 8
+
+        # Attention
+        attention-backend: "trtllm_mla"
+        kv-cache-dtype: "fp8_e4m3"
+
+        # MoE
+        moe-runner-backend: "flashinfer_trtllm"
+
+        # Other flags
+        stream-interval: 30
+        watchdog-timeout: 1000000
+        enable-flashinfer-allreduce-fusion: true
+        disable-radix-cache: true
+
+  health_check:
+    max_attempts: 360
+    interval_seconds: 10
+
+  benchmark:
+    type: "sa-bench"
+    isl: 1024
+    osl: 1024
+    req_rate: "inf"
+
+
+# STP low-latency: tep8 decode (DP=1), scale sweep 1p5d and 1p6d
+zip_override_stp_lowlat:
+  name:
+    - "b200-fp4-stp-low-latency-dep4-1p-tep8-5d"
+    - "b200-fp4-stp-low-latency-dep4-1p-tep8-6d"
+  resources:
+    decode_nodes: [5, 6]
+    decode_workers: [5, 6]
+  benchmark:
+    concurrencies: ["16x128", "32x64x256"]
+
+
+# MTP low-latency: same scales as STP, adds EAGLE speculative decoding + fp4-gemm-backend
+zip_override_mtp_lowlat:
+  name:
+    - "b200-fp4-mtp-low-latency-dep4-1p-tep8-5d"
+    - "b200-fp4-mtp-low-latency-dep4-1p-tep8-6d"
+  resources:
+    decode_nodes: [5, 6]
+    decode_workers: [5, 6]
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        fp4-gemm-backend: "flashinfer_trtllm"
+      decode:
+        fp4-gemm-backend: "flashinfer_trtllm"
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: ["16x512", "32x64x256x512"]
+
+
+# STP max-throughput: dep8 decode (DP=8), scale sweep 1p1d and 1p2d
+# Adds SGLANG_MOE_NVFP4_DISPATCH + SGLANG_FLASHINFER_FP4_GEMM_BACKEND env vars
+# 1d: max-running-requests=1024; 2d: keeps 512
+zip_override_stp_maxtpt:
+  name:
+    - "b200-fp4-stp-max-tpt-dep4-1p-dep8-1d"
+    - "b200-fp4-stp-max-tpt-dep4-1p-dep8-2d"
+  resources:
+    decode_nodes: [1, 2]
+    decode_workers: [1, 2]
+  backend:
+    decode_environment:
+      SGLANG_MOE_NVFP4_DISPATCH: "1"
+      SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"
+    sglang_config:
+      prefill:
+        max-running-requests: [1024, 512]
+      decode:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: [1024, 512]
+        cuda-graph-max-bs: [1024, 512]
+  benchmark:
+    concurrencies: ["512", "512"]
+
+
+# MTP max-throughput: dep8 decode, scale sweep 1p1d and 1p2d, adds EAGLE speculative decoding
+# Adds SGLANG_MOE_NVFP4_DISPATCH + SGLANG_FLASHINFER_FP4_GEMM_BACKEND + fp4-gemm-backend
+# 1d: max-running-requests=1024, mem-fraction=0.75 for decode; 2d: keeps 512/0.85
+zip_override_mtp_maxtpt:
+  name:
+    - "b200-fp4-mtp-max-tpt-dep4-1p-dep8-1d"
+    - "b200-fp4-mtp-max-tpt-dep4-1p-dep8-2d"
+  resources:
+    decode_nodes: [1, 2]
+    decode_workers: [1, 2]
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_MOE_NVFP4_DISPATCH: "1"
+      SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        fp4-gemm-backend: "flashinfer_trtllm"
+        max-running-requests: [1024, 512]
+      decode:
+        fp4-gemm-backend: "flashinfer_trtllm"
+        mem-fraction-static: [0.75, 0.85]
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: [1024, 512]
+        cuda-graph-max-bs: [1024, 512]
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: ["512x1024", "512"]
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml
new file mode 100644
index 000000000..f5bfc9641
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml
@@ -0,0 +1,351 @@
+# B200-FP4 8k1k — STP and MTP in one file
+#
+# Three modes distinguished by override key names:
+#   override_stp_tp4 / override_mtp_tp4:  TP4 prefill (DP=1, EP=1) — low-latency single-node
+#   zip_override_stp_lowlat / zip_override_mtp_lowlat:  dep4 prefill + tep8 decode (DP=1)
+#   override_stp_maxtpt_7p2d / override_mtp_maxtpt_7p2d:  dep4 prefill + dep8 decode, 7p2d
+#   override_mtp_maxtpt_4p1d:  MTP-only 4p1d, no frontends, env-var FP4 backend
+#
+# Usage:
+#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml                              # all 11 variants
+#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml:*stp*                        # all STP variants
+#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml:*mtp*                        # all MTP variants
+#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml:override_stp_tp4             # STP tp4 only
+#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[0]   # STP 1p1d only
+#   srtctl dry-run -f recipes/b200-fp4/8k1k.yaml                             # preview
+
+base:
+  name: "b200-fp4-stp-8k1k"
+
+  dynamo:
+    version: 0.8.1
+
+  model:
+    path: "dsr1"
+    container: "dynamo-sglang"
+    precision: "fp4"
+
+  frontend:
+    type: dynamo
+    enable_multiple_frontends: true
+    num_additional_frontends: 4
+
+  resources:
+    gpu_type: "b200"
+    prefill_nodes: 1
+    prefill_workers: 1
+    gpus_per_prefill: 4
+    decode_nodes: 1
+    decode_workers: 1
+    gpus_per_node: 8
+
+  backend:
+    prefill_environment:
+      TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+      PYTHONUNBUFFERED: "1"
+      DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+      SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+      SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+      SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+      SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+      SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+      SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+      SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+      MC_FORCE_MNNVL: "1"
+      NCCL_MNNVL_ENABLE: "1"
+      NCCL_CUMEM_ENABLE: "1"
+      SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+      DYN_REQUEST_PLANE: nats
+    decode_environment:
+      TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+      PYTHONUNBUFFERED: "1"
+      DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+      SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+      SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+      SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+      SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+      SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+      SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+      SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+      SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+      MC_FORCE_MNNVL: "1"
+      NCCL_MNNVL_ENABLE: "1"
+      NCCL_CUMEM_ENABLE: "1"
+      SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+      DYN_REQUEST_PLANE: nats
+    sglang_config:
+      prefill:
+        # Model configuration
+        served-model-name: "deepseek-ai/DeepSeek-R1"
+        trust-remote-code: true
+        quantization: "modelopt_fp4"
+
+        # Disaggregation mode
+        disaggregation-mode: "prefill"
+        disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+        mem-fraction-static: 0.85
+        max-prefill-tokens: 32768
+        chunked-prefill-size: 32768
+        context-length: 9600
+        max-running-requests: 512
+        disable-cuda-graph: true
+
+        # Parallelism
+        tensor-parallel-size: 4
+        data-parallel-size: 4
+        expert-parallel-size: 4
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+
+        # Attention
+        attention-backend: "trtllm_mla"
+        kv-cache-dtype: "fp8_e4m3"
+
+        # MoE
+        moe-runner-backend: "flashinfer_trtllm"
+        moe-dense-tp-size: 1
+        fp4-gemm-backend: "flashinfer_trtllm"
+
+        # Other flags
+        stream-interval: 30
+        watchdog-timeout: 1000000
+        enable-flashinfer-allreduce-fusion: true
+        disable-radix-cache: true
+
+      decode:
+        # Model configuration
+        served-model-name: "deepseek-ai/DeepSeek-R1"
+        trust-remote-code: true
+        quantization: "modelopt_fp4"
+
+        # Disaggregation mode
+        disaggregation-mode: "decode"
+        disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+        mem-fraction-static: 0.85
+        max-prefill-tokens: 32768
+        chunked-prefill-size: 32768
+        context-length: 9600
+        max-running-requests: 512
+        cuda-graph-max-bs: 512
+
+        # Parallelism
+        tensor-parallel-size: 8
+        data-parallel-size: 1
+        expert-parallel-size: 8
+
+        # Attention
+        attention-backend: "trtllm_mla"
+        kv-cache-dtype: "fp8_e4m3"
+
+        # MoE
+        moe-runner-backend: "flashinfer_trtllm"
+        fp4-gemm-backend: "flashinfer_trtllm"
+
+        # Other flags
+        stream-interval: 30
+        watchdog-timeout: 1000000
+        enable-flashinfer-allreduce-fusion: true
+        disable-radix-cache: true
+
+  health_check:
+    max_attempts: 360
+    interval_seconds: 10
+
+  benchmark:
+    type: "sa-bench"
+    isl: 8192
+    osl: 1024
+    req_rate: "inf"
+
+
+# STP TP4 prefill mode: TP4 (DP=1, EP=1) instead of dep4 — low-latency single-node
+override_stp_tp4:
+  name: "b200-fp4-stp-low-latency-tp4-1p-tp8-1d"
+  frontend:
+    num_additional_frontends: 2
+  backend:
+    sglang_config:
+      prefill:
+        data-parallel-size: 1
+        expert-parallel-size: 1
+        enable-dp-attention: null
+        enable-dp-lm-head: null
+      decode:
+        expert-parallel-size: 1
+  benchmark:
+    concurrencies: "4x8x16x64"
+
+
+# MTP TP4 prefill mode: same as STP tp4 but adds EAGLE speculative decoding
+override_mtp_tp4:
+  name: "b200-fp4-mtp-low-latency-tp4-1p-tp8-1d"
+  frontend:
+    num_additional_frontends: 2
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        data-parallel-size: 1
+        expert-parallel-size: 1
+        enable-dp-attention: null
+        enable-dp-lm-head: null
+      decode:
+        expert-parallel-size: 1
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: "4x8x16x64"
+
+
+# STP low-latency: dep4 prefill + tep8 decode (DP=1), scale sweep 1p1d/1p5d/2p5d
+zip_override_stp_lowlat:
+  name:
+    - "b200-fp4-stp-low-latency-dep4-1p-tep8-1d"
+    - "b200-fp4-stp-low-latency-dep4-1p-tep8-5d"
+    - "b200-fp4-stp-low-latency-dep4-2p-tep8-5d"
+  resources:
+    prefill_nodes: [1, 1, 2]
+    prefill_workers: [1, 1, 2]
+    decode_nodes: [1, 5, 5]
+    decode_workers: [1, 5, 5]
+  benchmark:
+    concurrencies: ["64x128", "8", "4x128"]
+
+
+# MTP low-latency: same scales as STP, adds EAGLE speculative decoding
+zip_override_mtp_lowlat:
+  name:
+    - "b200-fp4-mtp-low-latency-dep4-1p-tep8-1d"
+    - "b200-fp4-mtp-low-latency-dep4-1p-tep8-5d"
+    - "b200-fp4-mtp-low-latency-dep4-2p-tep8-5d"
+  resources:
+    prefill_nodes: [1, 1, 2]
+    prefill_workers: [1, 1, 2]
+    decode_nodes: [1, 5, 5]
+    decode_workers: [1, 5, 5]
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      decode:
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: ["64x128", "8", "4x128"]
+
+
+# STP max-throughput 7p2d: dep4 prefill + dep8 decode, flashinfer_cutlass backend
+override_stp_maxtpt_7p2d:
+  name: "b200-fp4-stp-max-tpt-dep4-7p-dep8-2d"
+  resources:
+    prefill_nodes: 7
+    prefill_workers: 7
+    decode_nodes: 2
+    decode_workers: 2
+  backend:
+    decode_environment:
+      SGLANG_MOE_NVFP4_DISPATCH: "1"
+    sglang_config:
+      prefill:
+        max-prefill-tokens: 65536
+        chunked-prefill-size: 65536
+        max-running-requests: 1024
+        fp4-gemm-backend: "flashinfer_cutlass"
+      decode:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: 2048
+        cuda-graph-max-bs: 1024
+        fp4-gemm-backend: "flashinfer_cutlass"
+  benchmark:
+    concurrencies: "1024x2048"
+
+
+# MTP max-throughput 7p2d: same as STP but adds EAGLE speculative decoding
+override_mtp_maxtpt_7p2d:
+  name: "b200-fp4-mtp-max-tpt-dep4-7p-dep8-2d"
+  resources:
+    prefill_nodes: 7
+    prefill_workers: 7
+    decode_nodes: 2
+    decode_workers: 2
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_MOE_NVFP4_DISPATCH: "1"
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        max-prefill-tokens: 65536
+        chunked-prefill-size: 65536
+        max-running-requests: 1024
+        fp4-gemm-backend: "flashinfer_cutlass"
+      decode:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: 2048
+        cuda-graph-max-bs: 1024
+        fp4-gemm-backend: "flashinfer_cutlass"
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: "1024x2048"
+
+
+# MTP-only: 4p1d, no frontends, SGLANG_FLASHINFER_FP4_GEMM_BACKEND env var (fp4-gemm-backend: null
+# removes the sglang_config key), mem-fraction=0.75 for decode
+override_mtp_maxtpt_4p1d:
+  name: "b200-fp4-mtp-max-tpt-dep4-4p-dep8-1d"
+  dynamo: null
+  frontend: null
+  resources:
+    prefill_nodes: 4
+    prefill_workers: 4
+    decode_nodes: 1
+    decode_workers: 1
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_MOE_NVFP4_DISPATCH: "1"
+      SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        max-running-requests: 1024
+        fp4-gemm-backend: null
+      decode:
+        mem-fraction-static: 0.75
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: 1024
+        cuda-graph-max-bs: 1024
+        fp4-gemm-backend: null
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: "1024"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml
new file mode 100644
index 000000000..7489586aa
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml
@@ -0,0 +1,281 @@
+# B200-FP8 1k1k — STP and MTP in one file
+#
+# Two inference modes distinguished by override key names:
+#   zip_override_stp_*  — standard token prediction (no speculative decoding)
+#   zip_override_mtp_*  — multi-token prediction (EAGLE speculative decoding)
+#
+# Low-latency variants: tep8 decode (DP=1)
+# Max-throughput variants: dep8 decode (DP=8)
+#
+# Usage:
+#   srtctl apply  -f recipes/b200-fp8/1k1k.yaml                              # all 10 variants
+#   srtctl apply  -f recipes/b200-fp8/1k1k.yaml:*stp*                        # all STP variants
+#   srtctl apply  -f recipes/b200-fp8/1k1k.yaml:*mtp*                        # all MTP variants
+#   srtctl apply  -f recipes/b200-fp8/1k1k.yaml:zip_override_stp_lowlat[0]   # STP 1p1d only
+#   srtctl dry-run -f recipes/b200-fp8/1k1k.yaml                             # preview
+
+base:
+  name: "b200-fp8-stp-1k1k"
+
+  model:
+    path: "dsr1-fp8"
+    container: "dynamo-sglang"
+    precision: "fp8"
+
+  resources:
+    gpu_type: "b200"
+    prefill_nodes: 1
+    prefill_workers: 1
+    decode_nodes: 1
+    decode_workers: 1
+    gpus_per_node: 8
+
+  backend:
+    prefill_environment:
+      TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+      PYTHONUNBUFFERED: "1"
+      DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+      SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+      SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+      SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+      SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+      SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+      SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+      SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+      MC_FORCE_MNNVL: "1"
+      NCCL_MNNVL_ENABLE: "1"
+      NCCL_CUMEM_ENABLE: "1"
+      DYN_REQUEST_PLANE: nats
+    decode_environment:
+      TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+      PYTHONUNBUFFERED: "1"
+      DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+      SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+      SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+      SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+      SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+      SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+      SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+      SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+      SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+      MC_FORCE_MNNVL: "1"
+      NCCL_MNNVL_ENABLE: "1"
+      NCCL_CUMEM_ENABLE: "1"
+      DYN_REQUEST_PLANE: nats
+    sglang_config:
+      prefill:
+        # Model configuration
+        served-model-name: "deepseek-ai/DeepSeek-R1"
+        trust-remote-code: true
+        quantization: "fp8"
+
+        # Disaggregation mode
+        disaggregation-mode: "prefill"
+        disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+        mem-fraction-static: 0.85
+        max-prefill-tokens: 32768
+        chunked-prefill-size: 32768
+        context-length: 2200
+        max-running-requests: 512
+        disable-cuda-graph: true
+
+        # Parallelism
+        tensor-parallel-size: 8
+        data-parallel-size: 1
+        expert-parallel-size: 8
+
+        # Attention
+        attention-backend: "trtllm_mla"
+        kv-cache-dtype: "fp8_e4m3"
+
+        # MoE
+        moe-runner-backend: "flashinfer_trtllm"
+        # moe-dense-tp-size: 1
+
+        # Other flags
+        stream-interval: 30
+        watchdog-timeout: 1000000
+        enable-flashinfer-allreduce-fusion: true
+        disable-radix-cache: true
+
+      decode:
+        # Model configuration
+        served-model-name: "deepseek-ai/DeepSeek-R1"
+        trust-remote-code: true
+        quantization: "fp8"
+
+        # Disaggregation mode
+        disaggregation-mode: "decode"
+        disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+        mem-fraction-static: 0.85
+        max-prefill-tokens: 32768
+        chunked-prefill-size: 32768
+        context-length: 2200
+        max-running-requests: 512
+        cuda-graph-max-bs: 512
+
+        # Parallelism
+        tensor-parallel-size: 8
+        data-parallel-size: 1
+        expert-parallel-size: 8
+
+        # Attention
+        attention-backend: "trtllm_mla"
+        kv-cache-dtype: "fp8_e4m3"
+
+        # MoE
+        moe-runner-backend: "flashinfer_trtllm"
+        # moe-dense-tp-size: 1
+
+        # Other flags
+        stream-interval: 30
+        watchdog-timeout: 1000000
+        enable-flashinfer-allreduce-fusion: true
+        disable-radix-cache: true
+        # disable-chunked-prefix-cache: true
+
+  health_check:
+    max_attempts: 360
+    interval_seconds: 10
+
+  benchmark:
+    type: "sa-bench"
+    isl: 1024
+    osl: 1024
+    req_rate: "inf"
+
+
+# STP low-latency: tep8 decode (DP=1), scale sweep 1p1d and 1p3d
+zip_override_stp_lowlat:
+  name:
+    - "b200-fp8-stp-low-latency-tep8-1p-1d"
+    - "b200-fp8-stp-low-latency-tep8-1p-3d"
+  resources:
+    decode_nodes: [1, 3]
+    decode_workers: [1, 3]
+  benchmark:
+    concurrencies: ["4", "16x32x64x128x256"]
+
+
+# MTP low-latency: same scales as STP, adds EAGLE speculative decoding
+zip_override_mtp_lowlat:
+  name:
+    - "b200-fp8-mtp-low-latency-tep8-1p-1d"
+    - "b200-fp8-mtp-low-latency-tep8-1p-3d"
+  resources:
+    decode_nodes: [1, 3]
+    decode_workers: [1, 3]
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        moe-dense-tp-size: 1
+      decode:
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: ["4x64", "4x8x16x32x128"]
+
+
+# STP max-throughput: dep8 decode (DP=8), scale sweep 1p5d and 2p5d
+zip_override_stp_maxtpt:
+  name:
+    - "b200-fp8-stp-max-tpt-dep8-1p-5d"
+    - "b200-fp8-stp-max-tpt-dep8-2p-5d"
+  resources:
+    prefill_nodes: [1, 2]
+    prefill_workers: [1, 2]
+    decode_nodes: [5, 5]
+    decode_workers: [5, 5]
+  backend:
+    sglang_config:
+      prefill:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: 1024
+      decode:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        max-running-requests: 1024
+        cuda-graph-max-bs: 1024
+  benchmark:
+    concurrencies: ["1024", "2048"]
+
+
+# MTP max-throughput: dep8 decode, scale sweep 1p1d/1p5d/2p5d, adds EAGLE speculative decoding
+# Note: max-running-requests stays at 512 for MTP (unlike STP which raises to 1024)
+zip_override_mtp_maxtpt:
+  name:
+    - "b200-fp8-mtp-max-tpt-dep8-1p-1d"
+    - "b200-fp8-mtp-max-tpt-dep8-1p-5d"
+    - "b200-fp8-mtp-max-tpt-dep8-2p-5d"
+  resources:
+    prefill_nodes: [1, 1, 2]
+    prefill_workers: [1, 1, 2]
+    decode_nodes: [1, 5, 5]
+    decode_workers: [1, 5, 5]
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+      decode:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 2
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 3
+  benchmark:
+    concurrencies: ["512x1024x2048x4096", "512x4096", "1024x2048x4096"]
+
+
+# MTP special case: 1p2d uses speculative-num-steps=1 and draft-tokens=2 (vs 2/3 for all others)
+override_mtp_maxtpt_1p2d:
+  name: "b200-fp8-mtp-max-tpt-dep8-1p-2d"
+  resources:
+    decode_nodes: 2
+    decode_workers: 2
+  backend:
+    prefill_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    decode_environment:
+      SGLANG_ENABLE_SPEC_V2: "1"
+    sglang_config:
+      prefill:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+      decode:
+        data-parallel-size: 8
+        enable-dp-attention: true
+        enable-dp-lm-head: true
+        moe-dense-tp-size: 1
+        speculative-algorithm: "EAGLE"
+        speculative-num-steps: 1
+        speculative-eagle-topk: 1
+        speculative-num-draft-tokens: 2
+  benchmark:
+    concurrencies: "512x1024x2048"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml
new file mode 100644
index 000000000..36b78e975
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml
@@ -0,0 +1,148 @@
+name: b200-fp8-mtp-low-latency-tep8-1p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 3
+  decode_workers: 3
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 32
+      cuda-graph-max-bs: 32
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml
new file mode 100644
index 000000000..0fed3f9a6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml
@@ -0,0 +1,148 @@
+name: b200-fp8-mtp-low-latency-tep8-1p-4d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 4
+  decode_workers: 4
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 32
+      cuda-graph-max-bs: 32
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml
new file mode 100644
index 000000000..e39611a4b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml
@@ -0,0 +1,148 @@
+name: b200-fp8-mtp-low-latency-tep8-1p-6d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 22
+      cuda-graph-max-bs: 22
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml
new file mode 100644
index 000000000..78dc57d5a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml
@@ -0,0 +1,151 @@
+name: b200-fp8-mtp-max-tpt-dep8-1p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 2
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 128
+      cuda-graph-max-bs: 16
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml
new file mode 100644
index 000000000..202a10631
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml
@@ -0,0 +1,151 @@
+name: b200-fp8-mtp-max-tpt-dep8-1p-2d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 256
+      cuda-graph-max-bs: 32
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml
new file mode 100644
index 000000000..e2a619e29
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml
@@ -0,0 +1,151 @@
+name: b200-fp8-mtp-max-tpt-dep8-2p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 2
+  prefill_workers: 2
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 512
+      cuda-graph-max-bs: 64
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml
new file mode 100644
index 000000000..5e959ca38
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml
@@ -0,0 +1,151 @@
+name: b200-fp8-mtp-max-tpt-dep8-3p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 3
+  prefill_workers: 3
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+    SGLANG_ENABLE_SPEC_V2: '1'
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 1024
+      cuda-graph-max-bs: 128
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+      speculative-algorithm: EAGLE
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+health_check:
+  max_attempts: 720
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml
new file mode 100644
index 000000000..24d37e3ee
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml
@@ -0,0 +1,146 @@
+name: b200-fp8-stp-low-latency-tp8-1p-3d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 3
+  decode_workers: 3
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 32
+      cuda-graph-max-bs: 32
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+        # disable-chunked-prefix-cache: true
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml
new file mode 100644
index 000000000..c97d109d9
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml
@@ -0,0 +1,146 @@
+name: b200-fp8-stp-low-latency-tp8-1p-4d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 4
+  decode_workers: 4
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 32
+      cuda-graph-max-bs: 32
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+        # disable-chunked-prefix-cache: true
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml
new file mode 100644
index 000000000..503f1363b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml
@@ -0,0 +1,146 @@
+name: b200-fp8-stp-low-latency-tp8-1p-6d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 22
+      cuda-graph-max-bs: 22
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+        # disable-chunked-prefix-cache: true
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml
new file mode 100644
index 000000000..cb8d13717
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml
@@ -0,0 +1,147 @@
+name: b200-fp8-stp-max-tpt-dep8-1p-2d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 2
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml
new file mode 100644
index 000000000..875893e72
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml
@@ -0,0 +1,147 @@
+name: b200-fp8-stp-max-tpt-dep8-1p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 256
+      cuda-graph-max-bs: 256
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml
new file mode 100644
index 000000000..1402c1202
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml
@@ -0,0 +1,147 @@
+name: b200-fp8-stp-max-tpt-dep8-2p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 2
+  prefill_workers: 2
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 512
+      cuda-graph-max-bs: 512
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml
new file mode 100644
index 000000000..a689bf0ac
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml
@@ -0,0 +1,147 @@
+name: b200-fp8-stp-max-tpt-dep8-3p-1d
+
+dynamo:
+  version: 0.9.1
+
+model:
+  path: dsr1-fp8
+  container: dynamo-sglang
+  precision: fp8
+
+resources:
+  gpu_type: b200
+  prefill_nodes: 3
+  prefill_workers: 3
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
+    PYTHONUNBUFFERED: '1'
+    DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000'
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    MC_FORCE_MNNVL: '1'
+    NCCL_MNNVL_ENABLE: '1'
+    NCCL_CUMEM_ENABLE: '1'
+    DYN_REQUEST_PLANE: nats
+    CUDA_SCALE_LAUNCH_QUEUES: 4x
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1'
+
+  sglang_config:
+    prefill:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: prefill
+      disaggregation-transfer-backend: nixl
+      load-balance-method: round_robin
+
+        # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 8192
+      chunked-prefill-size: 65536
+      max-running-requests: 8
+      context-length: 9600
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 1
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+      moe-dense-tp-size: 1
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+
+    decode:
+        # Model configuration
+      served-model-name: deepseek-ai/DeepSeek-R1
+      trust-remote-code: true
+      quantization: fp8
+
+        # Disaggregation mode
+      disaggregation-mode: decode
+      disaggregation-transfer-backend: nixl
+
+        # Memory and token limits
+      mem-fraction-static: 0.75
+      context-length: 9600
+      max-running-requests: 1024
+      cuda-graph-max-bs: 1024
+
+        # Parallelism
+      tensor-parallel-size: 8
+      data-parallel-size: 8
+      expert-parallel-size: 8
+
+        # Attention
+      attention-backend: trtllm_mla
+      kv-cache-dtype: fp8_e4m3
+
+        # MoE
+      moe-runner-backend: flashinfer_trtllm
+
+        # Other flags
+      stream-interval: 30
+      watchdog-timeout: 1000000
+      enable-flashinfer-allreduce-fusion: true
+      disable-radix-cache: true
+      enable-dp-attention: true
+      enable-dp-lm-head: true
+      moe-dense-tp-size: 1
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml
new file mode 100644
index 000000000..b280e7176
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml
@@ -0,0 +1,128 @@
+name: "gb200-fp4-1k1k-low-latency"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 3
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1-fp4"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  decode_nodes: 2
+  prefill_workers: 1
+  decode_workers: 2
+  gpus_per_node: 4
+
+backend:
+
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  sglang_config:
+    prefill:
+      disaggregation-mode: "prefill"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      max-running-requests: 512
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      load-balance-method: "round_robin"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_trtllm"
+      data-parallel-size: 1
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+
+    decode:
+      disaggregation-mode: "decode"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      prefill-round-robin-balance: true
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      mem-fraction-static: 0.95
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_trtllm"
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+
+# InferenceX bench-serving wrapper, invoked via srt-slurm `benchmark.type: custom`.
+# Most env (MODEL, ISL, OSL, CONC_LIST, DISAGG) is exported by
+# benchmark-multinode-tmpl.yml and propagated through srtctl → srun → pyxis,
+# so the recipe only carries per-recipe knobs that have no workflow source.
+# See benchmarks/multi_node/srt_bench.sh for the full env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    # Override $MODEL because this sglang recipe advertises a different
+    # served-model-name from what master-yaml's `model:` field is set to.
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "12"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml
new file mode 100644
index 000000000..eb499618e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml
@@ -0,0 +1,190 @@
+name: "gb200-fp4-1k1k-max-tpt"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1-fp4"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 4
+  decode_nodes: 12
+  prefill_workers: 4
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutlass"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.84
+      max-total-tokens: 131072
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 65536
+      enable-single-batch-overlap: true
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 4
+      ep-size: 4
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 3122380
+      chunked-prefill-size: 786432
+
+      # Request handling
+      max-running-requests: 67584
+      enable-single-batch-overlap: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      # CUDA graphs (extensive batch size list)
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024]
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_cutlass"
+
+      # Parallelism
+      tp-size: 48
+      dp-size: 48
+      ep-size: 48
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "48"
+    TOTAL_GPUS: "64"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml
new file mode 100644
index 000000000..fdfce3821
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml
@@ -0,0 +1,189 @@
+name: "gb200-fp4-1k1k-mid-curve"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1-fp4"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 4
+  decode_nodes: 8
+  prefill_workers: 4
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutlass"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.84
+      max-total-tokens: 131072
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 65536
+      enable-single-batch-overlap: true
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 4
+      ep-size: 4
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 3122380
+      chunked-prefill-size: 786432
+
+      # Request handling
+      max-running-requests: 67584
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      # CUDA graphs (extensive batch size list)
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024]
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_cutlass"
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml
new file mode 100644
index 000000000..48b044bd3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml
@@ -0,0 +1,125 @@
+name: "gb200-fp4-8k1k-low-latency"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 4
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1-fp4"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_node: 4
+
+backend:
+
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  sglang_config:
+    prefill:
+      disaggregation-mode: "prefill"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      stream-interval: 50
+      watchdog-timeout: 1000000
+      context-length: 9600 
+      mem-fraction-static: 0.95
+      max-total-tokens: 32768 
+      chunked-prefill-size: 24576 
+      cuda-graph-max-bs: 256
+      max-running-requests: 512
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      load-balance-method: "round_robin"
+      disaggregation-bootstrap-port: 30001
+      data-parallel-size: 1
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_trtllm"
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+      enable-dp-attention: false
+ 
+    decode:
+      disaggregation-mode: "decode"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      prefill-round-robin-balance: true
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      stream-interval: 50
+      watchdog-timeout: 1000000
+      context-length: 9600
+      mem-fraction-static: 0.95
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_trtllm"
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+      enable-dp-attention: false
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "20"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml
new file mode 100644
index 000000000..cbf43343b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml
@@ -0,0 +1,186 @@
+name: "gb200-fp4-8k1k-max-tpt"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1-fp4"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 10
+  decode_nodes: 8
+  prefill_workers: 10
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.95
+      max-total-tokens: 131072
+      max-prefill-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: false
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 1
+      ep-size: 1
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 524288
+      chunked-prefill-size: 24576
+
+      # Request handling
+      max-running-requests: 16384
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      cuda-graph-max-bs: 512
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "72"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml
new file mode 100644
index 000000000..39f9ab7c8
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml
@@ -0,0 +1,186 @@
+name: "gb200-fp4-8k1k-mid-curve"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1-fp4"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 6
+  decode_nodes: 12
+  prefill_workers: 6
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.95
+      max-total-tokens: 131072
+      max-prefill-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: false
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 1
+      ep-size: 1
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 524288
+      chunked-prefill-size: 24576
+
+      # Request handling
+      max-running-requests: 16384
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      cuda-graph-max-bs: 512
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 48
+      dp-size: 48
+      ep-size: 48
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "48"
+    TOTAL_GPUS: "72"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml
new file mode 100644
index 000000000..5dc0c0c73
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml
@@ -0,0 +1,128 @@
+name: "gb200-fp8-1k1k-low-latency"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 2
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  decode_nodes: 1
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_ENABLE_FLASHINFER_GEMM: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  sglang_config:
+    prefill:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      disaggregation-mode: "prefill"
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 128
+      max-running-requests: 512
+      load-balance-method: "round_robin"
+      scheduler-recv-interval: 10
+      fp8-gemm-backend: "flashinfer_trtllm"
+      enable-symm-mem: true
+      tensor-parallel-size: 4
+      data-parallel-size: 1
+      expert-parallel-size: 1
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      disaggregation-mode: "decode"
+      mem-fraction-static: 0.95
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 128
+      max-running-requests: 128
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      prefill-round-robin-balance: true
+      tensor-parallel-size: 4
+      data-parallel-size: 1
+      expert-parallel-size: 1
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "8"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml
new file mode 100644
index 000000000..c7a9e0923
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml
@@ -0,0 +1,182 @@
+name: "gb200-fp8-1k1k-max-tpt"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 4
+  prefill_workers: 2
+  decode_nodes: 8
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
+      cuda-graph-max-bs: 768
+
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml
new file mode 100644
index 000000000..0de49d6d7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml
@@ -0,0 +1,181 @@
+name: "gb200-fp8-1k1k-mid-curve"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 6
+  prefill_workers: 3
+  decode_nodes: 12
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 48
+      dp-size: 48
+      ep-size: 48
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
+      cuda-graph-max-bs: 768
+      disaggregation-transfer-backend: nixl
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "48"
+    TOTAL_GPUS: "72"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml
new file mode 100644
index 000000000..f335aa042
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml
@@ -0,0 +1,183 @@
+name: "gb200-fp8-1k1k-ultra-tpt"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 3
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "640"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 8192
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 5120
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640]
+      cuda-graph-max-bs: 640
+
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml
new file mode 100644
index 000000000..94ee5ed1f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml
@@ -0,0 +1,124 @@
+name: "gb200-fp8-8k1k-low-latency"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 2
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  decode_nodes: 2
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  sglang_config:
+    prefill:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      watchdog-timeout: 1000000
+      context-length: 9600 
+      disaggregation-mode: "prefill"
+      mem-fraction-static: 0.8
+      max-total-tokens: 32768 
+      chunked-prefill-size: 24576 
+      cuda-graph-max-bs: 512 
+      max-running-requests: 512
+      load-balance-method: "round_robin"
+      scheduler-recv-interval: 10
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      watchdog-timeout: 1000000
+      context-length: 9600 
+      disaggregation-mode: "decode"
+      mem-fraction-static: 0.8
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 512 
+      max-running-requests: 512
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      prefill-round-robin-balance: true
+      tensor-parallel-size: 8
+      data-parallel-size: 1
+      expert-parallel-size: 1
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml
new file mode 100644
index 000000000..2865f2e52
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml
@@ -0,0 +1,178 @@
+name: "gb200-8k1k-fp8-max-tpt"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 12
+  prefill_workers: 6
+  decode_nodes: 6
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 9300
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.80
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 24
+      dp-size: 24
+      ep-size: 24
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 8192 
+      context-length: 9300 
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512]
+      cuda-graph-max-bs: 512
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "24"
+    TOTAL_GPUS: "72"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml
new file mode 100644
index 000000000..a1559e71d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml
@@ -0,0 +1,177 @@
+name: "gb200-8k1k-fp8-mid-tpt"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 10 
+  prefill_workers: 5
+  decode_nodes: 8 
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "256"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 9300
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.80
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 8192 
+      context-length: 9300 
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+      # CUDA graphs
+      cuda-graph-max-bs: 256 
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "72"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml
new file mode 100644
index 000000000..c531f8446
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml
@@ -0,0 +1,123 @@
+name: "gb300-fp4-low-latency-1k1k"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 4
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  decode_nodes: 2
+  prefill_workers: 1
+  decode_workers: 2
+  gpus_per_node: 4
+
+backend:
+
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  sglang_config:
+    prefill:
+      disaggregation-mode: "prefill"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      max-running-requests: 512
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      load-balance-method: "round_robin"
+      disaggregation-bootstrap-port: 30001
+      data-parallel-size: 1
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+      fp4-gemm-backend: "flashinfer_trtllm"
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      disaggregation-mode: "decode"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      prefill-round-robin-balance: true
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      mem-fraction-static: 0.95
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+      fp4-gemm-backend: "flashinfer_trtllm"
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "12"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml
new file mode 100644
index 000000000..c4a3d6524
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml
@@ -0,0 +1,191 @@
+name: "gb300-fp4-max-tpt-1k1k"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  decode_nodes: 12
+  prefill_workers: 4
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutlass"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.84
+      max-total-tokens: 131072
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 65536
+      enable-single-batch-overlap: true
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: true
+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_cutlass"
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 4
+      ep-size: 4
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 3122380
+      chunked-prefill-size: 786432
+
+      # Request handling
+      max-running-requests: 67584
+      enable-single-batch-overlap: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      # CUDA graphs (extensive batch size list)
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024]
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+
+      # Parallelism
+      tp-size: 48
+      dp-size: 48
+      ep-size: 48
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "48"
+    TOTAL_GPUS: "64"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml
new file mode 100644
index 000000000..e6d388906
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml
@@ -0,0 +1,189 @@
+name: "gb300-fp4-mid-curve-1k1k"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  decode_nodes: 8
+  prefill_workers: 4
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutlass"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.84
+      max-total-tokens: 131072
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 65536
+      enable-single-batch-overlap: true
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 4
+      ep-size: 4
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 2176
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 3122380
+      chunked-prefill-size: 786432
+
+      # Request handling
+      max-running-requests: 67584
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      # CUDA graphs (extensive batch size list)
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024]
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml
new file mode 100644
index 000000000..5c95e1ffa
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml
@@ -0,0 +1,126 @@
+name: "gb300-8k1k-fp4-low-latency-8k1k"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 3
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_node: 4
+
+backend:
+
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+
+  sglang_config:
+    prefill:
+      disaggregation-mode: "prefill"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      stream-interval: 50
+      watchdog-timeout: 1000000
+      context-length: 9600 
+      mem-fraction-static: 0.95
+      max-total-tokens: 32768 
+      chunked-prefill-size: 24576 
+      cuda-graph-max-bs: 256
+      max-running-requests: 512
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      load-balance-method: "round_robin"
+      disaggregation-bootstrap-port: 30001
+      data-parallel-size: 1
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+      enable-dp-attention: false
+      fp4-gemm-backend: "flashinfer_trtllm"
+      disaggregation-transfer-backend: nixl
+
+
+    decode:
+      disaggregation-mode: "decode"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      prefill-round-robin-balance: true
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      stream-interval: 50
+      watchdog-timeout: 1000000
+      context-length: 9600
+      mem-fraction-static: 0.95
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 128
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+      enable-dp-attention: false
+      fp4-gemm-backend: "flashinfer_trtllm"
+      disaggregation-transfer-backend: nixl
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "20"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml
new file mode 100644
index 000000000..29a619a6f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml
@@ -0,0 +1,186 @@
+name: "gb300-fp4-8k1k-max-tpt"
+
+dynamo:
+  version: 0.8.1
+      
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 10
+  decode_nodes: 8
+  prefill_workers: 10
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.95
+      max-total-tokens: 131072
+      max-prefill-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: false
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 1
+      ep-size: 1
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 524288
+      chunked-prefill-size: 24576
+
+      # Request handling
+      max-running-requests: 16384
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      cuda-graph-max-bs: 512
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "72"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml
new file mode 100644
index 000000000..b4de76bb9
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml
@@ -0,0 +1,186 @@
+name: "gb300-fp4-8k1k-mid-curve"
+
+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx-sqsh
+
+model:
+  path: "dsr1"
+  container: "dynamo-sglang"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 6
+  decode_nodes: 12
+  prefill_workers: 6
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
+    SGLANG_MOE_NVFP4_DISPATCH: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.95
+      max-total-tokens: 131072
+      max-prefill-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      max-running-requests: 30000
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+      enable-dp-attention: false
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 4
+      dp-size: 1
+      ep-size: 1
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+
+      # KV cache and attention
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+      disable-chunked-prefix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      watchdog-timeout: 1000000
+      context-length: 9600
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.83
+      max-total-tokens: 524288
+      chunked-prefill-size: 24576
+
+      # Request handling
+      max-running-requests: 16384
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      ep-num-redundant-experts: 32
+
+      cuda-graph-max-bs: 512
+      num-reserved-decode-tokens: 112
+
+      # Additional decode optimizations
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      enable-dp-attention: true
+      fp4-gemm-backend: "flashinfer_cutlass"
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 48
+      dp-size: 48
+      ep-size: 48
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "48"
+    TOTAL_GPUS: "72"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml
new file mode 100644
index 000000000..57ea3ff5e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml
@@ -0,0 +1,129 @@
+name: "gb300-1k1k-fp8-low-latency"
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_node: 4
+
+slurm:
+  time_limit: "02:00:00"
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  sglang_config:
+    prefill:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192 
+      chunked-prefill-size: 8192
+      max-prefill-tokens: 8192 
+      cuda-graph-max-bs: 128
+      max-running-requests: 128 
+      load-balance-method: "round_robin"
+      scheduler-recv-interval: 10
+      enable-flashinfer-allreduce-fusion: false # to save mem
+      enable-symm-mem: false # to save mem 
+      tensor-parallel-size: 4
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+    decode:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+      mem-fraction-static: 0.85
+      chunked-prefill-size: -1 # save mem
+      cuda-graph-max-bs: 128
+      max-running-requests: 128 
+      scheduler-recv-interval: 1  # save mem
+      enable-flashinfer-allreduce-fusion: false # to save mem 
+      enable-symm-mem: false # to save mem
+      prefill-round-robin-balance: true
+      tensor-parallel-size: 4
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "20"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml
new file mode 100644
index 000000000..d27830a5f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml
@@ -0,0 +1,178 @@
+# GB300 FP8 Max Throughput Configuration
+
+name: "gb300-1k1k-fp8-max"
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2 
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+      
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+      
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 8 
+      dp-size: 8 
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 2200
+
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024]
+      cuda-graph-max-bs: 1024 
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml
new file mode 100644
index 000000000..507f5607a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml
@@ -0,0 +1,177 @@
+# GB300 FP8 Mid Throughput Configuration
+name: "gb300-1k1k-fp8-mid"
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  prefill_workers: 2
+  decode_nodes: 8 
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 2200
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+      
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+      
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 32 
+      dp-size: 32 
+      ep-size: 32 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 2200
+
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
+      cuda-graph-max-bs: 768
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml
new file mode 100644
index 000000000..766ecc632
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml
@@ -0,0 +1,128 @@
+name: "gb300-8k1k-fp8-low-latency"
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  decode_nodes: 1 
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_node: 4
+
+slurm:
+  time_limit: "02:00:00"
+
+backend:
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  sglang_config:
+    prefill:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 9300
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+      mem-fraction-static: 0.95
+      max-total-tokens: 32768 
+      chunked-prefill-size: 32768 
+      max-prefill-tokens: 32768 
+      cuda-graph-max-bs: 128
+      max-running-requests: 128 
+      load-balance-method: "round_robin"
+      scheduler-recv-interval: 10
+      enable-flashinfer-allreduce-fusion: false # to save mem
+      enable-symm-mem: false # to save mem 
+      tensor-parallel-size: 4
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+    decode:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "fp8"
+      moe-runner-backend: "flashinfer_trtllm"
+      fp8-gemm-backend: "flashinfer_trtllm"
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 9300
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+      mem-fraction-static: 0.85
+      chunked-prefill-size: -1 # save mem
+      cuda-graph-max-bs: 128
+      max-running-requests: 128 
+      scheduler-recv-interval: 1  # save mem
+      enable-flashinfer-allreduce-fusion: false # to save mem 
+      enable-symm-mem: false # to save mem
+      prefill-round-robin-balance: true
+      tensor-parallel-size: 4
+      data-parallel-size: 1
+      expert-parallel-size: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "8"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml
new file mode 100644
index 000000000..a7da42825
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml
@@ -0,0 +1,178 @@
+# GB300 FP8 Max Throughput Configuration
+
+name: "gb300-8k1k-fp8-max"
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 12 
+  prefill_workers: 6
+  decode_nodes: 6 
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 9300 
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+      
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+      
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 24 
+      dp-size: 24 
+      ep-size: 24 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 9300
+
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
+      cuda-graph-max-bs: 768
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "24"
+    TOTAL_GPUS: "72"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml
new file mode 100644
index 000000000..6c367ebf3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml
@@ -0,0 +1,178 @@
+# GB300 FP8 Mid Throughput Configuration
+
+name: "gb300-8k1k-fp8-mid"
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-sglang"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 10 
+  prefill_workers: 5
+  decode_nodes: 8 
+  decode_workers: 1
+  gpus_per_node: 4
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 9300 
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl
+      
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+      
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      disaggregation-transfer-backend: nixl
+
+      # Parallelism
+      tp-size: 32 
+      dp-size: 32 
+      ep-size: 32 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 9300
+
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
+      cuda-graph-max-bs: 768
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "72"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml
new file mode 100644
index 000000000..76f03d343
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml
@@ -0,0 +1,121 @@
+name: "h100-fp8-1p1d-max-dep-mtp"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx-sqsh
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Prefill capacity
+      max-running-requests: 4
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # MTP (Multi-Token Prediction)
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 16
+      ep-size: 16
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.85
+      max-running-requests: 64
+      cuda-graph-max-bs: 64
+
+      # MTP
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml
new file mode 100644
index 000000000..3c6647c24
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml
@@ -0,0 +1,123 @@
+name: "h100-fp8-1p2d-max-tp-mtp"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx-sqsh
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 4
+  decode_workers: 2
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      max-running-requests: 2
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # MTP (Multi-Token Prediction)
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.9
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+      # MTP
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml
new file mode 100644
index 000000000..dc186726c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml
@@ -0,0 +1,109 @@
+name: "h100-fp8-1p1d-max-dep"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 8
+
+frontend:
+  nginx_container: nginx-sqsh
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Prefill capacity
+      max-running-requests: 4
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 16
+      ep-size: 16
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.9
+      max-running-requests: 64
+      cuda-graph-max-bs: 64
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p2d-max-tp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p2d-max-tp.yaml
new file mode 100644
index 000000000..1e4b20c13
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p2d-max-tp.yaml
@@ -0,0 +1,109 @@
+name: "h100-fp8-1p2d-max-tp"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 4
+  decode_workers: 2
+  gpus_per_node: 8
+
+frontend:
+  nginx_container: nginx-sqsh
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1 
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      max-running-requests: 2
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.9
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml
new file mode 100644
index 000000000..17b87aba7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml
@@ -0,0 +1,123 @@
+name: "h100-fp8-1p1d-max-dep-mtp"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 8
+
+frontend:
+  nginx_container: nginx-sqsh
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Prefill capacity
+      max-running-requests: 4
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # MTP (Multi-Token Prediction)
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 16
+      ep-size: 16
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.85
+      max-running-requests: 64
+      cuda-graph-max-bs: 64
+
+      # MTP
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml
new file mode 100644
index 000000000..4dbe673c6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml
@@ -0,0 +1,123 @@
+name: "h100-fp8-1p1d-max-tp-mtp"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 8
+
+frontend:
+  nginx_container: nginx-sqsh
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Prefill capacity
+      max-running-requests: 2
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # MTP (Multi-Token Prediction)
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.9
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+      # MTP (Multi-Token Prediction)
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml
new file mode 100644
index 000000000..dc186726c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml
@@ -0,0 +1,109 @@
+name: "h100-fp8-1p1d-max-dep"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 8
+
+frontend:
+  nginx_container: nginx-sqsh
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Prefill capacity
+      max-running-requests: 4
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 16
+      ep-size: 16
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.9
+      max-running-requests: 64
+      cuda-graph-max-bs: 64
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-tp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-tp.yaml
new file mode 100644
index 000000000..120b9270c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-tp.yaml
@@ -0,0 +1,109 @@
+name: "h100-fp8-1p1d-max-tp"
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2
+  decode_workers: 1
+  gpus_per_node: 8
+
+frontend:
+  nginx_container: nginx-sqsh
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Prefill capacity
+      max-running-requests: 2
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 1
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.9
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml
new file mode 100644
index 000000000..d9177b2e1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml
@@ -0,0 +1,128 @@
+name: "bs256-1p6d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      # used to be 512
+      max-running-requests: 64
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      # usd to be 0.75
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 65536
+      # used to be 262144
+      chunked-prefill-size: 65536
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml
new file mode 100644
index 000000000..bbdea98a4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml
@@ -0,0 +1,124 @@
+name: "bs256-1p6d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 512
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.7
+      max-prefill-tokens: 163840
+      chunked-prefill-size: 163840
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml
new file mode 100644
index 000000000..2569666c2
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml
@@ -0,0 +1,123 @@
+name: "low-latency-1p9d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 9
+  decode_workers: 9
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 256
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 163840
+      chunked-prefill-size: 163840
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-running-requests: 64
+      cuda-graph-max-bs: 64
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml
new file mode 100644
index 000000000..0d098c736
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml
@@ -0,0 +1,116 @@
+name: "bs256-1p6d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 512
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-prefill-tokens: 65536
+      chunked-prefill-size: 262144
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 512
+      cuda-graph-max-bs: 512
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml
new file mode 100644
index 000000000..af5aded2c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml
@@ -0,0 +1,115 @@
+name: "bs256-1p6d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 512
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.7
+      max-prefill-tokens: 163840
+      chunked-prefill-size: 163840
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 512
+      cuda-graph-max-bs: 512
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml
new file mode 100644
index 000000000..9cfc153f2
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml
@@ -0,0 +1,113 @@
+name: "low-latency-1p9d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 9
+  decode_workers: 9
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 256
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 163840
+      chunked-prefill-size: 163840
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 256
+      cuda-graph-max-bs: 256
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml
new file mode 100644
index 000000000..292289a7e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml
@@ -0,0 +1,125 @@
+name: "bs128-1p1d-dep-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-prefill-tokens: 163840
+      chunked-prefill-size: 163840
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.85
+      max-running-requests: 192
+      cuda-graph-max-bs: 192
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml
new file mode 100644
index 000000000..76d9f6b1f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml
@@ -0,0 +1,123 @@
+name: "bs16-1p3d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 3
+  decode_workers: 3
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 32
+      cuda-graph-max-bs: 32
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml
new file mode 100644
index 000000000..01a278260
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml
@@ -0,0 +1,123 @@
+name: "bs4-1p7d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 7
+  decode_workers: 7
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-running-requests: 2
+      cuda-graph-max-bs: 2
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml
new file mode 100644
index 000000000..e426c78ba
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml
@@ -0,0 +1,132 @@
+name: "bs64-2p3d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 2
+  prefill_workers: 2
+  decode_nodes: 3
+  decode_workers: 3
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      context-length: 72000
+      max-total-tokens: 128000 
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-running-requests: 16
+      cuda-graph-max-bs: 16
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+# benchmark:
+#   type: "gpqa"
+#   num_examples: 198
+#   repeat: 4
+#   num_threads: 32
+#   max_tokens: 64000
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml
new file mode 100644
index 000000000..2922ba1df
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml
@@ -0,0 +1,124 @@
+name: "bs8-1p6d-h200-fp8-mtp"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+
+  # Prefill-specific environment variables
+  prefill_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  # Decode-specific environment variables
+  decode_environment:
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+      watchdog-timeout: 1000000
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 16
+      cuda-graph-max-bs: 16
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml
new file mode 100644
index 000000000..e86438436
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml
@@ -0,0 +1,116 @@
+name: "bs128-1p1d-dep-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 1
+  decode_workers: 1
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-prefill-tokens: 163840
+      chunked-prefill-size: 163840
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.88
+      max-running-requests: 256
+      cuda-graph-max-bs: 256
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml
new file mode 100644
index 000000000..75e36493b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml
@@ -0,0 +1,114 @@
+name: "bs16-1p3d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 3
+  decode_workers: 3
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 32
+      cuda-graph-max-bs: 32
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml
new file mode 100644
index 000000000..56aa58d11
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml
@@ -0,0 +1,114 @@
+name: "bs4-1p7d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 7
+  decode_workers: 7
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 8
+      cuda-graph-max-bs: 8
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml
new file mode 100644
index 000000000..7c876e3cf
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml
@@ -0,0 +1,122 @@
+name: "bs64-2p3d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 2
+  prefill_workers: 2
+  decode_nodes: 3
+  decode_workers: 3
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      #context-length: 72000
+      # max-total-tokens: 128000 
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 128
+      cuda-graph-max-bs: 128
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+# benchmark:
+#   type: "gpqa"
+#   num_examples: 198
+#   repeat: 4
+#   num_threads: 32
+#   max_tokens: 64000
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml
new file mode 100644
index 000000000..5eeba8f61
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml
@@ -0,0 +1,115 @@
+name: "bs8-1p6d-h200-fp8"
+
+model:
+  path: "dsr1"
+  container: "lmsysorg/sglang:v0.5.8.post1-cu130"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_nodes: 6
+  decode_workers: 6
+  gpus_per_node: 8
+
+backend:
+
+  prefill_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  decode_environment:
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+
+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      # stream-interval: 50
+      watchdog-timeout: 1000000
+      max-running-requests: 16
+      
+
+      # Prefill-specific mode
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1
+
+      # KV cache and attention
+      attention-backend: "flashinfer"
+
+      # Other flags
+      disable-radix-cache: true
+      stream-interval: 10
+      watchdog-timeout: 1000000
+
+      # Disagg
+      disaggregation-bootstrap-port: 30001
+      disaggregation-mode: "decode"
+      disaggregation-transfer-backend: nixl
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      max-running-requests: 16
+      cuda-graph-max-bs: 16
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-R1"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml
new file mode 100644
index 000000000..7e59b1617
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml
@@ -0,0 +1,128 @@
+name: "ctx1_gen2_dep8_batch64_eplb0_mtp2"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 192
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 58
+        - 60
+        - 62
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+# InferenceX bench-serving wrapper, invoked via srt-slurm `benchmark.type: custom`.
+# Most env (MODEL, ISL, OSL, CONC_LIST, DISAGG) is exported by
+# benchmark-multinode-tmpl.yml and propagated through srtctl → srun → pyxis,
+# so the recipe only carries per-recipe knobs that have no workflow source.
+# See benchmarks/multi_node/srt_bench.sh for the full env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"   # per prefill worker
+    DECODE_GPUS: "8"    # per decode worker
+    TOTAL_GPUS: "20"    # sum across all workers
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 000000000..6b34b2fb7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "ctx1_gen5_dep8_batch16_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 13
+        - 14
+        - 15
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..4445c953b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,118 @@
+name: "ctx1_gen5_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..b7d1c9260
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "ctx1_gen5_tep8_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 8
+        - 9
+        - 10
+        - 16
+        - 17
+        - 18
+        - 29
+        - 30
+        - 31
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml
new file mode 100644
index 000000000..d5def7a35
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,126 @@
+name: "ctx3_gen4_dep8_batch128_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 122
+        - 124
+        - 126
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml
new file mode 100644
index 000000000..dde552b51
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml
@@ -0,0 +1,132 @@
+name: "ctx3_gen5_dep4_batch512_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 3
+  gpus_per_decode: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 512
+      max_num_tokens: 1024
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 192
+        - 256
+        - 384
+        - 448
+        - 506
+        - 508
+        - 510
+        - 512
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "32"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..275c140a5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,123 @@
+name: "ctx1_gen1_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 384
+        - 448
+        - 508
+        - 510
+        - 512
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..ae7ba8483
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,120 @@
+name: "ctx1_gen2_dep8_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 122
+        - 124
+        - 126
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..16961a5e0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,118 @@
+name: "ctx1_gen5_dep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 26
+        - 28
+        - 30
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..ac84ded85
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,112 @@
+name: "ctx1_gen5_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..930f2520f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "ctx1_gen5_tep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 8
+        - 9
+        - 10
+        - 11
+        - 12
+        - 13
+        - 14
+        - 15
+        - 16
+        - 18
+        - 20
+        - 22
+        - 24
+        - 26
+        - 28
+        - 30
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..d90c6f3b0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,122 @@
+name: "ctx1_gen6_tep8_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 6
+  decode_nodes: 6
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 56
+        - 58
+        - 60
+        - 62
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "52"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..1017f8feb
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,122 @@
+name: "ctx1_gen1_dep8_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 5
+        - 6
+        - 7
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 000000000..4c919e2e1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,129 @@
+name: "ctx1_gen3_tep8_batch16_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 9
+        - 10
+        - 11
+        - 12
+        - 13
+        - 14
+        - 15
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..dec75f377
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,118 @@
+name: "ctx1_gen5_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..1c8582c31
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,125 @@
+name: "ctx1_gen5_tep8_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 6
+        - 7
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml
new file mode 100644
index 000000000..37ab36d1f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml
@@ -0,0 +1,126 @@
+name: "ctx3_gen1_dep8_batch64_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 48
+        - 56
+        - 60
+        - 62
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml
new file mode 100644
index 000000000..693c2221c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml
@@ -0,0 +1,130 @@
+name: "ctx5_gen1_dep8_batch192_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 3
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 192
+      max_num_tokens: 384
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 130
+        - 132
+        - 134
+        - 136
+        - 138
+        - 168
+        - 192
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..ffbc9ae61
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,125 @@
+name: "ctx5_gen2_dep8_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 3
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 20
+        - 24
+        - 28
+        - 30
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..b2c967541
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,113 @@
+name: "ctx1_gen5_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml
new file mode 100644
index 000000000..0f88bb006
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "ctx1_gen5_tep8_batch8_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 6
+        - 7
+        - 8
+        - 9
+        - 10
+        - 12
+        - 13
+        - 14
+        - 15
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..738dd82ea
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,121 @@
+name: "ctx2_gen5_tep8_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 58
+        - 60
+        - 62
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "48"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml
new file mode 100644
index 000000000..22681d23a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml
@@ -0,0 +1,124 @@
+name: "ctx4_gen1_dep8_batch192_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 192
+      max_num_tokens: 192
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 190
+        - 192
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..6e233467a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "ctx4_gen3_dep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 4
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 28
+        - 30
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..99f0ea58f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,120 @@
+name: "ctx7_gen2_dep8_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 4
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 116
+        - 120
+        - 124
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  nginx_container: "nginx-sqsh"
+  type: "dynamo"
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml
new file mode 100644
index 000000000..0fbd25b82
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen2_dep8_batch768_eplb0_mtp2_1600
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 768
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 768
+      max_num_tokens: 2304
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml
new file mode 100644
index 000000000..fe3ab4c6c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen3_dep8_batch384_eplb0_mtp3_1184
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 384
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 384
+      max_num_tokens: 1536
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml
new file mode 100644
index 000000000..ab8b4d1c6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen4_dep8_batch256_eplb0_mtp3_1024
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 256
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 256
+      max_num_tokens: 1024
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml
new file mode 100644
index 000000000..a2665a5a4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen7_dep8_batch128_eplb0_mtp3_896
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 7
+  decode_nodes: 7
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml
new file mode 100644
index 000000000..057fcbd77
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen8_tp8_batch1_eplb0_mtp3_8
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "72"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml
new file mode 100644
index 000000000..e42404618
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen8_tp8_batch32_eplb0_mtp3_256
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 32
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 32
+      max_num_tokens: 256
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "72"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml
new file mode 100644
index 000000000..042c00923
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen8_tp8_batch4_eplb0_mtp3_32
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 4
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 4
+      max_num_tokens: 256
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "72"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml
new file mode 100644
index 000000000..9ad27278a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen8_tp8_batch8_eplb0_mtp3_64
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 8
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "72"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml
new file mode 100644
index 000000000..65aeecbfa
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen1_dep8_batch512_eplb0_mtp0_4096
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 512
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 512
+      max_num_tokens: 4096
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 40
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml
new file mode 100644
index 000000000..6159a29ad
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen3_tp8_batch1024_eplb0_mtp0_128
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1024
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1024
+      max_num_tokens: 4096
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml
new file mode 100644
index 000000000..58d800b6a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen3_tp8_batch1024_eplb0_mtp0_32
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 12
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 12
+      max_num_tokens: 12
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml
new file mode 100644
index 000000000..0ed6396a0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen3_tp8_batch1024_eplb0_mtp0_4
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2176
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml
new file mode 100644
index 000000000..875279c47
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen5_dep8_batch48_eplb0_mtp0_1920
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 48
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 48
+      max_num_tokens: 4096
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml
new file mode 100644
index 000000000..c277966c4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml
@@ -0,0 +1,121 @@
+name: ctx2_gen5_dep8_batch128_eplb0_mtp0_5152
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 8
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1152
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 1152
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 4096
+      max_seq_len: 2176
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml
new file mode 100644
index 000000000..7f03ae1e3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen2_tp8_batch32_eplb0_mtp3_8
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        enable_balance: true
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 4
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml
new file mode 100644
index 000000000..712a67416
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen4_tp8_batch16_eplb0_mtp3_64
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        enable_balance: true
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml
new file mode 100644
index 000000000..4212abd06
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen6_tp8_batch8_eplb0_mtp3_48
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 6
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        enable_balance: true
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 8
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml
new file mode 100644
index 000000000..f3e356085
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen6_tp8_batch8_eplb0_mtp3_8
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 6
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        enable_balance: true
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml
new file mode 100644
index 000000000..cda4cecfd
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml
@@ -0,0 +1,131 @@
+name: ctx2_gen1_dep8_batch32_eplb0_mtp3_288
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        batching_wait_iters: 0
+        enable_balance: true
+        timeout_iters: 60
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 32
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 32
+      max_num_tokens: 1024
+      max_seq_len: 9344
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml
new file mode 100644
index 000000000..1cdb3af76
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml
@@ -0,0 +1,131 @@
+name: ctx2_gen3_dep8_batch8_eplb0_mtp3_224
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        batching_wait_iters: 0
+        enable_balance: true
+        timeout_iters: 60
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 8
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 9344
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml
new file mode 100644
index 000000000..359073927
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml
@@ -0,0 +1,131 @@
+name: ctx4_gen1_dep8_batch128_eplb0_mtp2_1088
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 4
+  prefill_workers: 4
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      attention_dp_config:
+        batching_wait_iters: 0
+        enable_balance: true
+        timeout_iters: 60
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 3072
+      max_seq_len: 9344
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml
new file mode 100644
index 000000000..7a9a20391
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen1_dep8_batch128_eplb0_mtp0_128
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml
new file mode 100644
index 000000000..3f93f9140
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen1_dep8_batch256_eplb0_mtp0_256
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 256
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml
new file mode 100644
index 000000000..ca1c1d60f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml
@@ -0,0 +1,123 @@
+name: ctx1_gen1_tp8_batch1_eplb0_mtp0_1
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: true
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml
new file mode 100644
index 000000000..6b03210e3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen2_dep8_batch64_eplb0_mtp0_128
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml
new file mode 100644
index 000000000..38ed548da
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml
@@ -0,0 +1,122 @@
+name: ctx1_gen4_tp8_batch32_eplb0_mtp0_128
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 32
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 32
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml
new file mode 100644
index 000000000..f086c23c0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml
@@ -0,0 +1,122 @@
+name: ctx1_gen4_tp8_batch32_eplb0_mtp0_32
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 8
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml
new file mode 100644
index 000000000..39f1bffd8
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml
@@ -0,0 +1,122 @@
+name: ctx1_gen6_tp8_batch16_eplb0_mtp0_96
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 6
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml
new file mode 100644
index 000000000..2b787d7f4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml
@@ -0,0 +1,121 @@
+name: ctx2_gen1_dep8_batch640_eplb0_mtp0_640
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b200"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_RNDV_SCHEME: "put_zcopy"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.2
+      max_batch_size: 1
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: DEFAULT
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 640
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 640
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml
new file mode 100644
index 000000000..554db4ec4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "ctx1_gen1_dep8_batch64_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "10"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 000000000..497739ac7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "ctx1_gen2_dep8_batch16_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "18"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..0fbaeb745
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,129 @@
+name: "ctx1_gen5_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "42"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..2d9df253b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,145 @@
+name: "ctx1_gen5_tep8_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 8
+        - 10
+        - 11
+        - 12
+        - 16
+        - 18
+        - 20
+        - 22
+        - 23
+        - 24
+        - 28
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "42"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml
new file mode 100644
index 000000000..c356b1b19
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml
@@ -0,0 +1,135 @@
+name: "ctx2_gen1_dep8_batch256_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml
new file mode 100644
index 000000000..5735ea337
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml
@@ -0,0 +1,136 @@
+name: "ctx5_gen2_dep8_batch512_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 5
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 512
+      max_num_tokens: 1024
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml
new file mode 100644
index 000000000..1eed2b318
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml
@@ -0,0 +1,137 @@
+name: "ctx5_gen2_dep8_batch768_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 5
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 768
+      max_num_tokens: 1536
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+        - 768
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..7d11fb152
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "ctx1_gen2_dep8_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "18"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..458ce824d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,123 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml
new file mode 100644
index 000000000..3e493c98e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "ctx1_gen5_tep4_batch4_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 3
+  gpus_per_decode: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "22"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..adb4a8b79
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,142 @@
+name: "ctx1_gen5_tep8_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 8
+        - 10
+        - 11
+        - 12
+        - 16
+        - 18
+        - 20
+        - 22
+        - 27
+        - 32
+        - 35
+        - 39
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "42"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..8bd76075a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,130 @@
+name: "ctx2_gen1_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml
new file mode 100644
index 000000000..76d4cd780
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml
@@ -0,0 +1,135 @@
+name: "ctx3_gen1_dep8_batch1024_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1024
+      max_num_tokens: 1024
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+        - 768
+        - 832
+        - 896
+        - 960
+        - 1024
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "14"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..3c0692530
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "ctx3_gen2_dep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 8
+      max_num_tokens: 10240
+      max_seq_len: 1044
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2068
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "22"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml
new file mode 100644
index 000000000..5f522818a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml
@@ -0,0 +1,135 @@
+name: "ctx10_gen1_dep8_batch256_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 3
+  prefill_workers: 10
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..41f443c22
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "ctx1_gen4_tep4_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      gpus_per_node: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "18"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..ff3bca726
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,129 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..87c3c57b6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "ctx1_gen4_tep8_batch4_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 000000000..3f40345ca
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "ctx3_gen1_dep8_batch16_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "14"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml
new file mode 100644
index 000000000..a52be413d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,134 @@
+name: "ctx9_gen1_dep8_batch128_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 3
+  prefill_workers: 9
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..f515e9aba
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "ctx1_gen3_tep4_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 2
+  gpus_per_decode: 4
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "14"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..7a167eb80
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "ctx1_gen3_tep8_batch16_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..36a6268eb
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,136 @@
+name: "ctx1_gen3_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+        - 768
+        - 1024
+        - 2048
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml
new file mode 100644
index 000000000..d184a95d5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml
@@ -0,0 +1,124 @@
+name: "ctx1_gen4_tep4_batch2_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 2
+      max_num_tokens: 2
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "18"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..bacd57645
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "ctx5_gen2_dep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 5
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..923b32c05
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,134 @@
+name: "ctx6_gen1_dep8_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 6
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+        - 768
+        - 1024
+        - 2048
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..1173417cc
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,134 @@
+name: "ctx8_gen1_dep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 8
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16896
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 32
+        - 64
+        - 128
+        - 256
+        - 512
+        - 768
+        - 1024
+        - 2048
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8448
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml
new file mode 100644
index 000000000..9e1da3cf3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml
@@ -0,0 +1,139 @@
+name: ctx1_gen1_dp8_batch256_eplb0_mtp1_3072
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 256
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 256
+      max_num_tokens: 2100
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml
new file mode 100644
index 000000000..d1ccc8b44
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml
@@ -0,0 +1,139 @@
+name: ctx1_gen2_dep8_batch128_eplb0_mtp1_2560
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 1100
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml
new file mode 100644
index 000000000..74802bbc7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml
@@ -0,0 +1,139 @@
+name: ctx1_gen5_dep8_batch16_eplb0_mtp2_720
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 180
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml
new file mode 100644
index 000000000..4a09efd68
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml
@@ -0,0 +1,140 @@
+name: ctx1_gen8_tp8_batch16_eplb0_mtp3_160
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 384
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "68"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml
new file mode 100644
index 000000000..a6cbb9b66
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml
@@ -0,0 +1,140 @@
+name: ctx1_gen8_tp8_batch1_eplb0_mtp3_10
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "68"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml
new file mode 100644
index 000000000..7ccdfa4af
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml
@@ -0,0 +1,139 @@
+name: ctx3_gen2_dp8_batch512_eplb0_mtp1_11264
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 512
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 512
+      max_num_tokens: 4200
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml
new file mode 100644
index 000000000..fa0675ade
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml
@@ -0,0 +1,133 @@
+name: ctx1_gen1_dep8_batch256_eplb0_mtp0_2112
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 256
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 256
+      max_num_tokens: 2048
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml
new file mode 100644
index 000000000..121844730
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml
@@ -0,0 +1,133 @@
+name: ctx1_gen2_dp8_batch128_eplb0_mtp0_3072
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 1024
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml
new file mode 100644
index 000000000..7a7b2e1fe
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml
@@ -0,0 +1,133 @@
+name: ctx1_gen3_dp8_batch48_eplb0_mtp0_1280
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 48
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 48
+      max_num_tokens: 384
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml
new file mode 100644
index 000000000..0e75f3747
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml
@@ -0,0 +1,134 @@
+name: ctx1_gen8_tp8_batch64_eplb0_mtp0_12
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "68"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml
new file mode 100644
index 000000000..384ef6e0c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml
@@ -0,0 +1,134 @@
+name: ctx1_gen8_tp8_batch64_eplb0_mtp0_128
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "68"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml
new file mode 100644
index 000000000..5fb7781d4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml
@@ -0,0 +1,134 @@
+name: ctx1_gen8_tp8_batch64_eplb0_mtp0_384
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "68"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml
new file mode 100644
index 000000000..364b538d6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml
@@ -0,0 +1,133 @@
+name: ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 1280
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 1280
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1024
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1024
+      max_num_tokens: 8192
+      max_seq_len: 2400
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml
new file mode 100644
index 000000000..1039c9e2c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml
@@ -0,0 +1,139 @@
+name: ctx1_gen1_dp8_batch8_eplb0_mtp3_72
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 8
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 8
+      max_num_tokens: 90
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml
new file mode 100644
index 000000000..89a1abdd3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml
@@ -0,0 +1,140 @@
+name: ctx1_gen2_tp8_batch16_eplb0_mtp3_40
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 80
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml
new file mode 100644
index 000000000..87ad50002
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml
@@ -0,0 +1,140 @@
+name: ctx1_gen4_tp8_batch1_eplb0_mtp3_8
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml
new file mode 100644
index 000000000..4edbcf88d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml
@@ -0,0 +1,140 @@
+name: ctx1_gen4_tp8_batch4_eplb0_mtp3_20
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 4
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 4
+      max_num_tokens: 20
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml
new file mode 100644
index 000000000..7eba0cdd6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml
@@ -0,0 +1,139 @@
+name: ctx2_gen1_dp8_batch16_eplb0_mtp3_144
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 180
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml
new file mode 100644
index 000000000..555ec7688
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml
@@ -0,0 +1,139 @@
+name: ctx4_gen1_dp8_batch64_eplb0_mtp2_512
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 650
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml
new file mode 100644
index 000000000..8c9160c66
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml
@@ -0,0 +1,134 @@
+name: ctx1_gen4_tp8_batch16_eplb0_mtp0_64
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 16
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 16
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml
new file mode 100644
index 000000000..54de6c71f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml
@@ -0,0 +1,134 @@
+name: ctx1_gen8_tp8_batch2_eplb0_mtp0_16
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 8
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 1
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "68"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml
new file mode 100644
index 000000000..4e7808183
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml
@@ -0,0 +1,133 @@
+name: ctx2_gen1_dp8_batch32_eplb0_mtp0_256
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 32
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 32
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml
new file mode 100644
index 000000000..6d6573b24
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml
@@ -0,0 +1,133 @@
+name: ctx3_gen1_dp8_batch64_eplb0_mtp0_512
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml
new file mode 100644
index 000000000..dd915b01d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml
@@ -0,0 +1,134 @@
+name: ctx3_gen5_tp8_batch64_eplb0_mtp0_256
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 64
+      disable_overlap_scheduler: false
+      enable_attention_dp: false
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "52"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml
new file mode 100644
index 000000000..1e0375787
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml
@@ -0,0 +1,133 @@
+name: ctx5_gen1_dp8_batch128_eplb0_mtp0_1075
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 3
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 128
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml
new file mode 100644
index 000000000..eb6170f6a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml
@@ -0,0 +1,133 @@
+name: ctx7_gen1_dep8_batch384_eplb0_mtp0_3072
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "b300"
+  prefill_nodes: 4
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 1
+  gpus_per_decode: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  decode_environment:
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    OMPI_MCA_coll_ucc_enable: "0"
+    TLLM_ALL_RANK_LOG: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    UCX_MAX_RMA_RAILS: "1"
+    UCX_MAX_RNDV_RAILS: "1"
+    UCX_RNDV_SCHEME: "put_zcopy"
+    OMPI_MCA_btl: "tcp,self"
+    OMPI_MCA_pml: "ob1"
+    TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1"
+
+  trtllm_config:
+    prefill:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: false
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      max_batch_size: 8
+      max_num_tokens: 8320
+      max_seq_len: 8320
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 1
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: AUTO
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8320
+      cuda_graph_config:
+        enable_padding: true
+        max_batch_size: 384
+      disable_overlap_scheduler: false
+      enable_attention_dp: true
+      enable_iter_perf_stats: false
+      enable_iter_req_stats: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 384
+      max_num_tokens: 512
+      max_seq_len: 9344
+      moe_config:
+        backend: TRTLLM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 20
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
\ No newline at end of file
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..f6cb09bbc
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "ctx1_gen1_dep32_batch4_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..aa711f76c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,127 @@
+name: "ctx1_gen4_tep8_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 6
+        - 7
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml
new file mode 100644
index 000000000..50a8aa6c4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml
@@ -0,0 +1,158 @@
+name: "ctx2_gen1_dep16_batch256_eplb256_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml
new file mode 100644
index 000000000..53fae254f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml
@@ -0,0 +1,134 @@
+name: "ctx3_gen1_dep32_batch64_eplb288_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 288
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml
new file mode 100644
index 000000000..507a15f85
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml
@@ -0,0 +1,219 @@
+name: "ctx3_gen5_dep4_batch768_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 768
+      max_num_tokens: 1536
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        - 520
+        - 528
+        - 536
+        - 544
+        - 552
+        - 560
+        - 568
+        - 576
+        - 584
+        - 592
+        - 600
+        - 608
+        - 616
+        - 624
+        - 632
+        - 640
+        - 648
+        - 656
+        - 664
+        - 672
+        - 680
+        - 688
+        - 696
+        - 704
+        - 712
+        - 720
+        - 728
+        - 736
+        - 744
+        - 752
+        - 760
+        - 768
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..24294befe
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,119 @@
+name: "ctx1_gen1_dep32_batch16_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..67fd9d9a4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,181 @@
+name: "ctx1_gen1_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 1
+  decode_nodes: 2
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml
new file mode 100644
index 000000000..57be7c35e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml
@@ -0,0 +1,213 @@
+name: "ctx1_gen2_dep4_batch768_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 768
+      max_num_tokens: 768
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        - 520
+        - 528
+        - 536
+        - 544
+        - 552
+        - 560
+        - 568
+        - 576
+        - 584
+        - 592
+        - 600
+        - 608
+        - 616
+        - 624
+        - 632
+        - 640
+        - 648
+        - 656
+        - 664
+        - 672
+        - 680
+        - 688
+        - 696
+        - 704
+        - 712
+        - 720
+        - 728
+        - 736
+        - 744
+        - 752
+        - 760
+        - 768
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..e8794eae8
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,116 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..e9d59aaab
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,131 @@
+name: "ctx1_gen4_tep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 6
+        - 8
+        - 9
+        - 10
+        - 11
+        - 12
+        - 16
+        - 22
+        - 23
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml
new file mode 100644
index 000000000..c752a5600
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml
@@ -0,0 +1,152 @@
+name: "ctx2_gen1_dep16_batch256_eplb256_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..118580aa9
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "ctx2_gen1_dep32_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml
new file mode 100644
index 000000000..0ccf95443
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml
@@ -0,0 +1,158 @@
+name: "ctx11_gen1_dep16_batch256_eplb256_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 11
+  prefill_workers: 11
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "60"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..2854854f2
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,129 @@
+name: "ctx1_gen4_tep8_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 5
+        - 6
+        - 7
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..bddcf060e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "ctx3_gen1_dep32_batch4_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml
new file mode 100644
index 000000000..eb101a191
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml
@@ -0,0 +1,134 @@
+name: "ctx7_gen1_dep16_batch64_eplb256_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 7
+  prefill_workers: 7
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 000000000..3bf47d0a8
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,125 @@
+name: "ctx8_gen1_dep32_batch16_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 8
+  prefill_workers: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml
new file mode 100644
index 000000000..7cfee6b2e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml
@@ -0,0 +1,152 @@
+name: "ctx10_gen1_dep16_batch256_eplb256_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 10
+  prefill_workers: 10
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..a7e491533
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "ctx1_gen4_tep8_batch16_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 6
+        - 8
+        - 9
+        - 10
+        - 11
+        - 14
+        - 15
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..fa6483998
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,118 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml
new file mode 100644
index 000000000..c0d6dc3f3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml
@@ -0,0 +1,118 @@
+name: "ctx2_gen1_dep32_batch8_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..b78f93a10
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,121 @@
+name: "ctx7_gen1_dep32_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 7
+  prefill_workers: 7
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "60"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..080186d0f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "ctx8_gen1_dep16_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 8
+  prefill_workers: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml
new file mode 100644
index 000000000..6ea81b176
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml
@@ -0,0 +1,133 @@
+name: ctx1_gen1_dep16_batch64_eplb0_mtp1_1229
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml
new file mode 100644
index 000000000..8e5f86356
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen1_dep32_batch16_eplb0_mtp3_615
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml
new file mode 100644
index 000000000..a96a862ef
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml
@@ -0,0 +1,157 @@
+name: ctx1_gen1_dep8_batch256_eplb0_mtp1_2151
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml
new file mode 100644
index 000000000..449ca1d85
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml
@@ -0,0 +1,189 @@
+name: ctx1_gen1_dep8_batch512_eplb0_mtp1_4301
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 512
+      max_num_tokens: 1024
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml
new file mode 100644
index 000000000..e6f72bd07
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml
@@ -0,0 +1,126 @@
+name: ctx1_gen3_tep8_batch2_eplb0_mtp3_9
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 2
+      max_num_tokens: 8
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml
new file mode 100644
index 000000000..519f5da0c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml
@@ -0,0 +1,126 @@
+name: ctx1_gen3_tep8_batch4_eplb0_mtp3_18
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml
new file mode 100644
index 000000000..23c1180d5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml
@@ -0,0 +1,127 @@
+name: ctx1_gen3_tep8_batch8_eplb0_mtp3_36
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml
new file mode 100644
index 000000000..868c65032
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml
@@ -0,0 +1,135 @@
+name: ctx1_gen1_dep16_batch128_eplb0_mtp0_2151
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml
new file mode 100644
index 000000000..64f1004f5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml
@@ -0,0 +1,123 @@
+name: ctx1_gen1_dep32_batch32_eplb0_mtp0_1127
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml
new file mode 100644
index 000000000..05f3d0763
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml
@@ -0,0 +1,120 @@
+name: ctx1_gen1_dep32_batch8_eplb0_mtp0_256
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml
new file mode 100644
index 000000000..5fcaf989c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml
@@ -0,0 +1,183 @@
+name: ctx1_gen1_dep8_batch512_eplb0_mtp0_4301
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml
new file mode 100644
index 000000000..5f54ed0f7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml
@@ -0,0 +1,215 @@
+name: ctx1_gen1_dep8_batch768_eplb0_mtp0_6144
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        - 520
+        - 528
+        - 536
+        - 544
+        - 552
+        - 560
+        - 568
+        - 576
+        - 584
+        - 592
+        - 600
+        - 608
+        - 616
+        - 624
+        - 632
+        - 640
+        - 648
+        - 656
+        - 664
+        - 672
+        - 680
+        - 688
+        - 696
+        - 704
+        - 712
+        - 720
+        - 728
+        - 736
+        - 744
+        - 752
+        - 760
+        - 768
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 768
+      max_num_tokens: 768
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml
new file mode 100644
index 000000000..801c5214a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml
@@ -0,0 +1,120 @@
+name: ctx1_gen3_tep8_batch1_eplb0_mtp0_3
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml
new file mode 100644
index 000000000..9c57a2897
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml
@@ -0,0 +1,121 @@
+name: ctx1_gen3_tep8_batch8_eplb0_mtp0_27
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.5
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml
new file mode 100644
index 000000000..12632ffd1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml
@@ -0,0 +1,126 @@
+name: ctx1_gen3_tep8_batch2_eplb0_mtp3_6
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 2
+      max_num_tokens: 8
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml
new file mode 100644
index 000000000..a80c790f9
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml
@@ -0,0 +1,126 @@
+name: ctx1_gen3_tep8_batch4_eplb0_mtp3_15
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml
new file mode 100644
index 000000000..1f108d424
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml
@@ -0,0 +1,125 @@
+name: ctx2_gen1_dep32_batch2_eplb0_mtp3_90
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 4
+  prefill_workers: 2
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 2
+      max_num_tokens: 8
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml
new file mode 100644
index 000000000..08f63213f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml
@@ -0,0 +1,127 @@
+name: ctx3_gen1_dep16_batch16_eplb0_mtp3_333
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 6
+  prefill_workers: 3
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml
new file mode 100644
index 000000000..982765ae5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml
@@ -0,0 +1,133 @@
+name: ctx3_gen1_dep8_batch64_eplb0_mtp3_666
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 6
+  prefill_workers: 3
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml
new file mode 100644
index 000000000..6b286ce2e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml
@@ -0,0 +1,126 @@
+name: ctx4_gen1_dep32_batch8_eplb0_mtp3_333
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 8
+  prefill_workers: 4
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml
new file mode 100644
index 000000000..9bc424961
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml
@@ -0,0 +1,129 @@
+name: ctx5_gen1_dep16_batch32_eplb0_mtp3_666
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 10
+  prefill_workers: 5
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml
new file mode 100644
index 000000000..0430ce4b1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml
@@ -0,0 +1,122 @@
+name: ctx1_gen3_tep8_batch16_eplb0_mtp0_63
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml
new file mode 100644
index 000000000..d1b526a07
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml
@@ -0,0 +1,120 @@
+name: ctx1_gen3_tep8_batch1_eplb0_mtp0_6
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml
new file mode 100644
index 000000000..fdf1e856c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml
@@ -0,0 +1,120 @@
+name: ctx1_gen3_tep8_batch4_eplb0_mtp0_18
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 2
+  prefill_workers: 1
+  gpus_per_prefill: 8
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml
new file mode 100644
index 000000000..2dffe83f1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml
@@ -0,0 +1,120 @@
+name: ctx2_gen1_dep32_batch8_eplb0_mtp0_333
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 4
+  prefill_workers: 2
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml
new file mode 100644
index 000000000..ba7c6142f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml
@@ -0,0 +1,123 @@
+name: ctx3_gen1_dep16_batch32_eplb0_mtp0_615
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 6
+  prefill_workers: 3
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml
new file mode 100644
index 000000000..8675bf58d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml
@@ -0,0 +1,121 @@
+name: ctx4_gen1_dep32_batch16_eplb0_mtp0_666
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 8
+  prefill_workers: 4
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml
new file mode 100644
index 000000000..ca9b432d0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml
@@ -0,0 +1,127 @@
+name: ctx5_gen1_dep16_batch64_eplb0_mtp0_1229
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 10
+  prefill_workers: 5
+  gpus_per_prefill: 8
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 8
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  nginx_container: "nginx-sqsh"
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml
new file mode 100644
index 000000000..b3d1dd62a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml
@@ -0,0 +1,127 @@
+name: "ctx1_gen1_dep32_batch8_eplb0_mtp"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml
new file mode 100644
index 000000000..2b9d42408
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml
@@ -0,0 +1,222 @@
+name: "ctx1_gen1_dep4_batch768_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 768
+      max_num_tokens: 1536
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        - 520
+        - 528
+        - 536
+        - 544
+        - 552
+        - 560
+        - 568
+        - 576
+        - 584
+        - 592
+        - 600
+        - 608
+        - 616
+        - 624
+        - 632
+        - 640
+        - 648
+        - 656
+        - 664
+        - 672
+        - 680
+        - 688
+        - 696
+        - 704
+        - 712
+        - 720
+        - 728
+        - 736
+        - 744
+        - 752
+        - 760
+        - 768
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "6"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..c2c4c537a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,125 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..da70d4074
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,130 @@
+name: "ctx1_gen4_tep8_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 6
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml
new file mode 100644
index 000000000..12174174c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml
@@ -0,0 +1,145 @@
+name: "ctx3_gen1_dep16_batch128_eplb256_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "22"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml
new file mode 100644
index 000000000..502ae7cf2
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "ctx3_gen1_dep32_batch32_eplb288_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 288
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "38"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..cba8a4f64
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,119 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..794556055
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "ctx1_gen4_tep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+        - 6
+        - 8
+        - 10
+        - 11
+        - 12
+        - 16
+        - 18
+        - 20
+        - 24
+        - 28
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..8249a5369
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,123 @@
+name: "ctx2_gen1_dep32_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml
new file mode 100644
index 000000000..5f96315ff
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml
@@ -0,0 +1,247 @@
+name: "ctx2_gen1_dep8_batch1024_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 2
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1024
+      max_num_tokens: 1024
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        - 520
+        - 528
+        - 536
+        - 544
+        - 552
+        - 560
+        - 568
+        - 576
+        - 584
+        - 592
+        - 600
+        - 608
+        - 616
+        - 624
+        - 632
+        - 640
+        - 648
+        - 656
+        - 664
+        - 672
+        - 680
+        - 688
+        - 696
+        - 704
+        - 712
+        - 720
+        - 728
+        - 736
+        - 744
+        - 752
+        - 760
+        - 768
+        - 776
+        - 784
+        - 792
+        - 800
+        - 808
+        - 816
+        - 824
+        - 832
+        - 840
+        - 848
+        - 856
+        - 864
+        - 872
+        - 880
+        - 888
+        - 896
+        - 904
+        - 912
+        - 920
+        - 928
+        - 936
+        - 944
+        - 952
+        - 960
+        - 968
+        - 976
+        - 984
+        - 992
+        - 1000
+        - 1008
+        - 1016
+        - 1024
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml
new file mode 100644
index 000000000..50f4f8f0f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml
@@ -0,0 +1,155 @@
+name: "ctx3_gen1_dep16_batch256_eplb256_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "22"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..9acddc31e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,128 @@
+name: "ctx3_gen1_dep32_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 2048
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "38"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..4d258c289
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,129 @@
+name: "ctx10_gen1_dep16_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 5
+  prefill_workers: 10
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml
new file mode 100644
index 000000000..c10a8598b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml
@@ -0,0 +1,157 @@
+name: "ctx10_gen1_dep8_batch256_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 5
+  prefill_workers: 10
+
+  decode_workers: 1
+  decode_nodes: 2
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml
new file mode 100644
index 000000000..df0375f0e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml
@@ -0,0 +1,137 @@
+name: "ctx13_gen1_dep16_batch64_eplb256_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 7
+  prefill_workers: 13
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          num_slots: 256
+          layer_updates_per_iter: 1
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "42"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..6ce834ce3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,128 @@
+name: "ctx1_gen3_tep8_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 6
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..53771a342
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,125 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..b2349f421
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,128 @@
+name: "ctx1_gen4_tep8_batch4_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 3
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..ddd5641a9
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,125 @@
+name: "ctx4_gen1_dep32_batch4_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..aaca79561
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,126 @@
+name: "ctx8_gen1_dep32_batch8_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  prefill_workers: 8
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..f141a5005
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,152 @@
+name: "ctx11_gen3_dep4_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 6
+  prefill_workers: 11
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..882083834
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,135 @@
+name: "ctx14_gen1_dep16_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 7
+  prefill_workers: 14
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..e4568f7e1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,123 @@
+name: "ctx1_gen3_tep8_batch16_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 6
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "26"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..5a6e21737
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,119 @@
+name: "ctx1_gen4_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml
new file mode 100644
index 000000000..4b8ad5a43
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml
@@ -0,0 +1,120 @@
+name: "ctx1_gen4_tep8_batch2_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 2
+      max_num_tokens: 2
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      allreduce_strategy: MNNVL
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml
new file mode 100644
index 000000000..6f6194a84
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml
@@ -0,0 +1,120 @@
+name: "ctx1_gen5_tep4_batch4_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "22"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..f68b83534
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,122 @@
+name: "ctx7_gen1_dep32_batch16_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  prefill_workers: 7
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "46"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..db6ae1b3f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,128 @@
+name: "ctx9_gen1_dep16_batch64_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "dynamo-trtllm"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 5
+  prefill_workers: 9
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_ENABLE_PDL: "1"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      moe_config:
+        backend: TRTLLM
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+        dtype: fp8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      nvfp4_gemm_config:
+        allowed_backends:
+        - cutlass
+        - cublaslt
+        - cutedsl
+        - cuda_core
+      cache_transceiver_config:
+        max_tokens_in_buffer: 16384
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "2"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "34"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
+
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml
new file mode 100644
index 000000000..f03320ce7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml
@@ -0,0 +1,132 @@
+name: ctx1_gen1_dep16_batch32_eplb0_mtp3_666
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml
new file mode 100644
index 000000000..3783dd563
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml
@@ -0,0 +1,128 @@
+name: ctx1_gen1_dep32_batch4_eplb0_mtp3_180
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml
new file mode 100644
index 000000000..d4cf77025
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen4_tep8_batch1_eplb0_mtp3_8
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml
new file mode 100644
index 000000000..e6d895550
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen4_tep8_batch4_eplb0_mtp3_24
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml
new file mode 100644
index 000000000..f178dc30a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml
@@ -0,0 +1,144 @@
+name: ctx2_gen1_dep16_batch128_eplb0_mtp1_2253
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml
new file mode 100644
index 000000000..562ada512
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml
@@ -0,0 +1,130 @@
+name: ctx2_gen1_dep32_batch16_eplb0_mtp3_564
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml
new file mode 100644
index 000000000..87ba559b2
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml
@@ -0,0 +1,192 @@
+name: ctx3_gen2_dep8_batch512_eplb0_mtp1_8192
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 512
+      max_num_tokens: 1024
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml
new file mode 100644
index 000000000..57803a156
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml
@@ -0,0 +1,125 @@
+name: ctx1_gen4_tep8_batch16_eplb0_mtp0_84
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml
new file mode 100644
index 000000000..3f3905468
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml
@@ -0,0 +1,123 @@
+name: ctx1_gen4_tep8_batch1_eplb0_mtp0_4
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml
new file mode 100644
index 000000000..6e2ba5e8e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml
@@ -0,0 +1,123 @@
+name: ctx1_gen4_tep8_batch4_eplb0_mtp0_24
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 2088
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml
new file mode 100644
index 000000000..2580bab99
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml
@@ -0,0 +1,138 @@
+name: ctx2_gen1_dep16_batch128_eplb0_mtp0_2253
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml
new file mode 100644
index 000000000..c7dc2dcdd
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml
@@ -0,0 +1,126 @@
+name: ctx2_gen1_dep32_batch32_eplb0_mtp0_1229
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml
new file mode 100644
index 000000000..c4613dbb2
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml
@@ -0,0 +1,186 @@
+name: ctx3_gen2_dep8_batch512_eplb0_mtp0_8602
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml
new file mode 100644
index 000000000..bdc07bf9d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml
@@ -0,0 +1,218 @@
+name: ctx3_gen2_dep8_batch768_eplb0_mtp0_12288
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+        - 520
+        - 528
+        - 536
+        - 544
+        - 552
+        - 560
+        - 568
+        - 576
+        - 584
+        - 592
+        - 600
+        - 608
+        - 616
+        - 624
+        - 632
+        - 640
+        - 648
+        - 656
+        - 664
+        - 672
+        - 680
+        - 688
+        - 696
+        - 704
+        - 712
+        - 720
+        - 728
+        - 736
+        - 744
+        - 752
+        - 760
+        - 768
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 768
+      max_num_tokens: 768
+      max_seq_len: 2088
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml
new file mode 100644
index 000000000..95a1bd02e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml
@@ -0,0 +1,136 @@
+name: ctx10_gen1_dep16_batch64_eplb0_mtp1_1229
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 10
+  prefill_workers: 10
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml
new file mode 100644
index 000000000..644b5a20b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen4_tep8_batch1_eplb0_mtp3_8
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml
new file mode 100644
index 000000000..5c7a8ed5c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml
@@ -0,0 +1,129 @@
+name: ctx1_gen4_tep8_batch4_eplb0_mtp3_24
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml
new file mode 100644
index 000000000..c78705873
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml
@@ -0,0 +1,129 @@
+name: ctx6_gen1_dep32_batch8_eplb0_mtp3_333
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 6
+  prefill_workers: 6
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml
new file mode 100644
index 000000000..e00287de7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml
@@ -0,0 +1,144 @@
+name: ctx7_gen1_dep8_batch128_eplb0_mtp1_1229
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml
new file mode 100644
index 000000000..162f003e4
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml
@@ -0,0 +1,132 @@
+name: ctx8_gen1_dep16_batch32_eplb0_mtp3_666
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 8
+  prefill_workers: 8
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml
new file mode 100644
index 000000000..3a470113e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml
@@ -0,0 +1,123 @@
+name: ctx1_gen4_tep8_batch1_eplb0_mtp0_4
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml
new file mode 100644
index 000000000..8b14ffd93
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml
@@ -0,0 +1,123 @@
+name: ctx1_gen4_tep8_batch4_eplb0_mtp0_24
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml
new file mode 100644
index 000000000..f5994c054
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml
@@ -0,0 +1,124 @@
+name: ctx1_gen4_tep8_batch8_eplb0_mtp0_36
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      allreduce_strategy: MNNVL
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        enable_padding: true
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 9256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml
new file mode 100644
index 000000000..fcf7292da
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml
@@ -0,0 +1,126 @@
+name: ctx4_gen1_dep16_batch32_eplb0_mtp0_666
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  prefill_workers: 4
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml
new file mode 100644
index 000000000..ac8d6faa6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml
@@ -0,0 +1,124 @@
+name: ctx6_gen1_dep32_batch16_eplb0_mtp0_512
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 6
+  prefill_workers: 6
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 32
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 32
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml
new file mode 100644
index 000000000..e585cc065
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml
@@ -0,0 +1,130 @@
+name: ctx7_gen1_dep16_batch64_eplb0_mtp0_1229
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 16
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 16
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml
new file mode 100644
index 000000000..87272ba14
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml
@@ -0,0 +1,154 @@
+name: ctx7_gen1_dep8_batch256_eplb0_mtp0_2151
+
+model:
+  path: "dsr1-fp8"
+  container: "dynamo-trtllm"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    TLLM_OVERRIDE_LAYER_NUM: "61"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    TRTLLM_ENABLE_PDL: "1"
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED"
+    ENABLE_CONFIGURABLE_MOE: "1"
+
+  trtllm_config:
+    prefill:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_attention_dp: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.1
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      moe_config:
+        backend: DEEPGEMM
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      tensor_parallel_size: 4
+
+
+    decode:
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      cuda_graph_config:
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        enable_padding: true
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      moe_config:
+        backend: DEEPGEMM
+        use_low_precision_moe_combine: true
+      moe_expert_parallel_size: 8
+      num_postprocess_workers: 4
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      stream_interval: 100
+      tensor_parallel_size: 8
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+
+  enable_multiple_frontends: false
+
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
+infra:
+  etcd_nats_dedicated_node: true
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 000000000..67da71d3d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,111 @@
+name: h100_1k1k_ctx1dep16_gen1dep16_batch32_eplb0_mtp2_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 000000000..766d7fd79
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,115 @@
+name: h100_1k1k_ctx1dep16_gen1dep16_batch64_eplb0_mtp1_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..d2e17ac7a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,107 @@
+name: h100_1k1k_ctx1dep16_gen3dep16_batch4_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..a48f9c94a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,120 @@
+name: h100_1k1k_ctx1dep16_gen3tep16_batch128_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml
new file mode 100644
index 000000000..c07b82fad
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,106 @@
+name: h100_1k1k_ctx1dep16_gen3tep16_batch16_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..d64e9777c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,104 @@
+name: h100_1k1k_ctx1dep16_gen3tep16_batch1_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml
new file mode 100644
index 000000000..077357b39
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml
@@ -0,0 +1,104 @@
+name: h100_1k1k_ctx1dep16_gen3tep16_batch2_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 2
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..414388c6b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,108 @@
+name: h100_1k1k_ctx1dep16_gen3tep16_batch32_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..d49f37947
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,105 @@
+name: h100_1k1k_ctx1dep16_gen3tep16_batch8_eplb0_mtp3_chunked_false
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..1624bcc3e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,103 @@
+name: ctx1dep16_gen3dep16_batch16_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..f632508e1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,105 @@
+name: ctx1dep16_gen3dep16_batch32_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml
new file mode 100644
index 000000000..6cd4b7697
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml
@@ -0,0 +1,101 @@
+name: ctx1dep16_gen3dep16_batch4_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml
new file mode 100644
index 000000000..10ab482b3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml
@@ -0,0 +1,102 @@
+name: ctx1dep16_gen3dep16_batch8_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..850acc0da
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,100 @@
+name: ctx1dep16_gen3tep16_batch16_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..a1d5c9aac
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,98 @@
+name: ctx1dep16_gen3tep16_batch1_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml
new file mode 100644
index 000000000..c3b1144bd
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml
@@ -0,0 +1,98 @@
+name: ctx1dep16_gen3tep16_batch2_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 2
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml
new file mode 100644
index 000000000..2e972e14b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml
@@ -0,0 +1,99 @@
+name: ctx1dep16_gen3tep16_batch8_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..3dd8f5482
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: ctx2dep16_gen1dep16_batch256_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 2
+  prefill_nodes: 4
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  decode_environment:
+    UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 2
+      max_num_tokens: 2048
+      max_seq_len: 2048
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8192
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml
new file mode 100644
index 000000000..007d7e4eb
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,107 @@
+name: h100_8k1k_ctx1dep16_gen1dep16_batch4_eplb0_mtp3
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 4
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "32"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..ecf82c12b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,109 @@
+name: h100_8k1k_ctx1dep16_gen2tep16_batch32_eplb0_mtp3
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 32
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..221dfc3f7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,105 @@
+name: h100_8k1k_ctx1dep16_gen3tep16_batch1_eplb0_mtp3
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml
new file mode 100644
index 000000000..3b6a18fe6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml
@@ -0,0 +1,105 @@
+name: h100_8k1k_ctx1dep16_gen3tep16_batch2_eplb0_mtp3
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 2
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..baf2c1e0d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,106 @@
+name: h100_8k1k_ctx1dep16_gen3tep16_batch8_eplb0_mtp3
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml
new file mode 100644
index 000000000..8be542e76
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,108 @@
+name: h100_8k1k_ctx2dep16_gen1dep16_batch8_eplb0_mtp3
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 2
+  prefill_nodes: 4
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..0bf877f96
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,110 @@
+
+
+name: "h100_8k1k_ctx1dep16_gen2tep16_batch64_eplb0_mtp0"
+
+model:
+  path: "DeepSeek-R1-0528"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: "1"
+    TRTLLM_FORCE_ALLTOALL_METHOD: "DeepEP"
+
+  decode_environment:
+    NCCL_NVLS_ENABLE: "0"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config: 
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64]
+      print_iter_log: true
+      kv_cache_config: 
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config: 
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # There are errors about colliding on port 8080, and others.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..b68e4f1a5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,100 @@
+name: h100_8k1k_ctx1dep16_gen3tep16_batch1_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 1
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml
new file mode 100644
index 000000000..06b713a32
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml
@@ -0,0 +1,110 @@
+
+
+name: "h100_8k1k_ctx1dep16_gen3tep16_batch2_eplb0_mtp0"
+
+model:
+  path: "DeepSeek-R1-0528"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: "1"
+    TRTLLM_FORCE_ALLTOALL_METHOD: "DeepEP"
+
+  decode_environment:
+    NCCL_NVLS_ENABLE: "0"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config: 
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 2
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4]
+      print_iter_log: true
+      kv_cache_config: 
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config: 
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # There are errors about colliding on port 8080, and others.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml
new file mode 100644
index 000000000..030c98654
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml
@@ -0,0 +1,110 @@
+
+
+name: "h100_8k1k_ctx1dep16_gen3tep16_batch8_eplb0_mtp0"
+
+model:
+  path: "DeepSeek-R1-0528"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: "fp8"
+
+resources:
+  gpu_type: "h100"
+  prefill_workers: 1
+  prefill_nodes: 2
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: "1"
+    TRTLLM_FORCE_ALLTOALL_METHOD: "DeepEP"
+
+  decode_environment:
+    NCCL_NVLS_ENABLE: "0"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "n"
+
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: true
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config: 
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 8
+      max_num_tokens: 256
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8]
+      print_iter_log: true
+      kv_cache_config: 
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config: 
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # There are errors about colliding on port 8080, and others.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..1f882bc75
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,103 @@
+name: h100_8k1k_ctx2dep16_gen1dep16_batch16_eplb0_mtp0
+model:
+  path: DeepSeek-R1-0528
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3"
+  precision: fp8
+resources:
+  gpu_type: h100
+  prefill_workers: 2
+  prefill_nodes: 4
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_node: 8
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  decode_environment:
+    NCCL_NVLS_ENABLE: '0'
+    UCX_CUDA_IPC_ENABLE_MNNVL: n
+    TRTLLM_ENABLE_PDL: '1'
+    TRTLLM_SERVER_DISABLE_GC: '1'
+    TRTLLM_WORKER_DISABLE_GC: '1'
+    NCCL_GRAPH_MIXING_SUPPORT: '0'
+    TLLM_LOG_LEVEL: INFO
+    TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1'
+    TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP
+  trtllm_config:
+    prefill:
+      max_batch_size: 1
+      max_num_tokens: 8224
+      max_seq_len: 8232
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      pipeline_parallel_size: 1
+      print_iter_log: true
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      enable_chunked_prefill: false
+      moe_config:
+        backend: WIDEEP
+        max_num_tokens: 16384
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.3
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8256
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      pipeline_parallel_size: 1
+      max_batch_size: 16
+      max_num_tokens: 128
+      max_seq_len: 9256
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+      print_iter_log: true
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      moe_config:
+        backend: WIDEEP
+        use_low_precision_moe_combine: true
+      cache_transceiver_config:
+        max_tokens_in_buffer: 8256
+        backend: UCX
+      stream_interval: 100
+      num_postprocess_workers: 4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "16"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "48"
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..230e3a281
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,113 @@
+name: "c128_ctx1_gen7_dep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_workers: 7
+  decode_nodes: 7
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  trtllm_config:
+    prefill:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..b66e9d91a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,143 @@
+name: "c16_ctx1_gen9_tep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..246c12a61
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "c1_ctx1_gen11_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 11
+  decode_nodes: 11
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode, aggressive ctx:gen 1:11 for c=4)
+      # ISL/OSL: 1k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=4, TEP mode)
+      # ISL/OSL: 1k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "96"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..84c66f292
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,113 @@
+name: "c256_ctx1_gen4_dep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  trtllm_config:
+    prefill:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..898b6b248
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,113 @@
+name: "c32_ctx1_gen11_tep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_workers: 11
+  decode_nodes: 11
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  trtllm_config:
+    prefill:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "96"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..ff64103a1
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,141 @@
+name: "c4_ctx1_gen11_tep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 11
+  decode_nodes: 11
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode, aggressive ctx:gen 1:11 for c=4)
+      # ISL/OSL: 1k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=4, TEP mode)
+      # ISL/OSL: 1k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "96"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml
new file mode 100644
index 000000000..04d320697
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml
@@ -0,0 +1,159 @@
+name: "c512_ctx1_gen2_dep8_batch256_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 2
+  decode_nodes: 2
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=512)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..af18c65d3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,143 @@
+name: "c64_ctx1_gen8_dep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 8
+  decode_nodes: 8
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=64)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "72"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
new file mode 100644
index 000000000..f0e0f9a58
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml
@@ -0,0 +1,113 @@
+name: "c8_ctx1_gen11_tep8_batch128_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_workers: 11
+  decode_nodes: 11
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  trtllm_config:
+    prefill:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "96"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..eaa74f374
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,188 @@
+name: "c128_ctx1_gen9_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      # Matches E2E standalone ctx_config.yaml
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      # Matches E2E standalone gen_config.yaml (DEP c=128)
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..03de93867
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "c16_ctx1_gen9_tep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..0f29aab2f
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,119 @@
+name: "c1_ctx1_gen9_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..4393dacf8
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,107 @@
+name: "c256_ctx1_gen6_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+  decode_workers: 6
+  decode_nodes: 6
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+  trtllm_config:
+    prefill:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,264,272,280,288,296,304,312,320,328,336,344,352,360,368,376,384,392,400,408,416,424,432,440,448,456,464,472,480,488,496,504,512]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..9b2d8fbf5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "c32_ctx1_gen9_tep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..ee3a951cf
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "c4_ctx1_gen9_tep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..6356363ac
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,188 @@
+name: "c512_ctx2_gen7_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 7
+  decode_nodes: 7
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      # Matches E2E standalone ctx_config.yaml
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      # Matches E2E standalone gen_config.yaml (DEP c=128)
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+        - 264
+        - 272
+        - 280
+        - 288
+        - 296
+        - 304
+        - 312
+        - 320
+        - 328
+        - 336
+        - 344
+        - 352
+        - 360
+        - 368
+        - 376
+        - 384
+        - 392
+        - 400
+        - 408
+        - 416
+        - 424
+        - 432
+        - 440
+        - 448
+        - 456
+        - 464
+        - 472
+        - 480
+        - 488
+        - 496
+        - 504
+        - 512
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "72"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..ce67bee55
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "c64_ctx1_gen9_tep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..a5522bdad
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "c8_ctx1_gen9_tep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 9
+  decode_nodes: 9
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 8
+      max_num_tokens: 8192
+      max_seq_len: 1064
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 8192
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+        - 1
+        - 2
+        - 4
+        - 8
+        - 16
+        - 24
+        - 32
+        - 40
+        - 48
+        - 56
+        - 64
+        - 72
+        - 80
+        - 88
+        - 96
+        - 104
+        - 112
+        - 120
+        - 128
+        - 136
+        - 144
+        - 152
+        - 160
+        - 168
+        - 176
+        - 184
+        - 192
+        - 200
+        - 208
+        - 216
+        - 224
+        - 232
+        - 240
+        - 248
+        - 256
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "80"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml
new file mode 100644
index 000000000..1ad52f9f3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,123 @@
+name: "c128_ctx2_gen1_dep8_batch32_eplb0_mtp2"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=128)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml
new file mode 100644
index 000000000..23ad0751a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,123 @@
+name: "c16_ctx1_gen3_tep8_batch32_eplb0_mtp2"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 000000000..4649032a7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "c1_ctx1_gen7_tep8_batch1_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 7
+  decode_nodes: 7
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=4)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml
new file mode 100644
index 000000000..92ed944df
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,123 @@
+name: "c256_ctx3_gen1_dep8_batch32_eplb0_mtp2"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=256)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..01616d163
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "c32_ctx3_gen5_tep8_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=32)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..78cc69344
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "c4_ctx1_gen7_tep8_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 7
+  decode_nodes: 7
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=4)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml
new file mode 100644
index 000000000..607011f5c
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,123 @@
+name: "c512_ctx3_gen1_dep8_batch64_eplb0_mtp1"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=512)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32, 64]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml
new file mode 100644
index 000000000..02db00cb0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,123 @@
+name: "c64_ctx1_gen1_dep8_batch32_eplb0_mtp2"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=64)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 000000000..89cefb58e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,123 @@
+name: "c8_ctx1_gen6_tep8_batch32_eplb0_mtp3"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 6
+  decode_nodes: 6
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (MTP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (MTP c=8)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..6f9e2c92e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,120 @@
+name: "c128_ctx1_gen1_dep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      # Matches E2E standalone ctx_config.yaml
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      # Matches E2E standalone gen_config.yaml (DEP c=128)
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..a7cc5137e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c16_ctx1_gen3_tep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=16)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 000000000..82064a374
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c1_ctx1_gen7_tep8_batch1_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 7
+  decode_nodes: 7
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=4)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..da13164cd
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c256_ctx5_gen3_dep8_batch256_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 5
+  prefill_workers: 5
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (DEP c=256)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..38d63593a
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c32_ctx2_gen5_tep8_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=32)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..19ba51ba6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c4_ctx1_gen7_tep8_batch32_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 7
+  decode_nodes: 7
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=4)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "64"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml
new file mode 100644
index 000000000..3b35f1299
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c512_ctx3_gen1_dep8_batch512_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 3
+  prefill_workers: 3
+
+  decode_workers: 1
+  decode_nodes: 1
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (DEP c=512)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..531f573f3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c64_ctx2_gen3_dep8_batch128_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 2
+  prefill_workers: 2
+
+  decode_workers: 3
+  decode_nodes: 3
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (DEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (DEP c=64)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_chunked_prefill: false
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..c8a885d95
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,117 @@
+name: "c8_ctx1_gen6_tep8_batch16_eplb0_mtp0"
+
+model:
+  path: "dsr1"
+  container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
+  precision: "fp8"
+
+sbatch_directives:
+  cpus-per-gpu: "16"
+
+resources:
+  gpu_type: "h200"
+  prefill_nodes: 1
+  prefill_workers: 1
+
+  decode_workers: 6
+  decode_nodes: 6
+
+  gpus_per_node: 8
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  decode_environment:
+    UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+
+  trtllm_config:
+    prefill:
+      # Prefill Worker Config for Dynamo DSR1 (TEP mode)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 32768
+      moe_config:
+        backend: CUTLASS
+      cuda_graph_config: null
+      disable_overlap_scheduler: true
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+    decode:
+      # Decode Worker Config for Dynamo DSR1 (TEP c=8)
+      # ISL/OSL: 8k/1k, TP=8 on H200
+      backend: pytorch
+      trust_remote_code: true
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_chunked_prefill: false
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      kv_cache_config:
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+        dtype: fp8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      moe_config:
+        backend: CUTLASS
+        use_low_precision_moe_combine: true
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes: [1, 2, 4, 8, 16]
+      disable_overlap_scheduler: false
+      print_iter_log: true
+      # Performance tuning
+      stream_interval: 100
+      num_postprocess_workers: 4
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "56"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx.
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml
new file mode 100644
index 000000000..b6eca4631
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml
@@ -0,0 +1,154 @@
+name: "svf-vllm-disagg-gb200-mid-curve"
+
+# Mirrored from NVIDIA/srt-slurm aflowers/vllm-gb200-v0.20.0 branch:
+#   recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve.yaml
+#
+# Topology: 1 prefill (DEP=8) + 1 decode (DEP=8). 5 nodes total with a
+# dedicated NATS/etcd infra node. Mid-curve point at concurrency 256.
+#
+# Local deltas vs upstream:
+#   * model.path alias renamed deepseekv4-fp4 -> deepseek-v4-pro to match
+#     SRT_SLURM_MODEL_PREFIX in runners/launch_gb200-nv.sh.
+#   * model.container set to vllm/vllm-openai:v0.20.0-ubuntu2404 to
+#     match nvidia-master.yaml image (which the launch script registers as
+#     the alias key in srtslurm.yaml). Upstream variants ship either the
+#     non-dynamo floating tag or a sha256 pin.
+#   * slurm.time_limit + health_check set to 8h / 1440 attempts to
+#     absorb cold-cache /mnt/numa1 model loads.
+model:
+  path: "deepseek-v4-pro"
+  container: "vllm/vllm-openai:v0.20.0-ubuntu2404"
+  precision: "fp4"
+
+dynamo:
+  install: true
+  wheel: "1.2.0.dev20260426"
+
+setup_script: vllm-container-deps.sh
+
+slurm:
+  time_limit: "8:00:00"
+
+health_check:
+  max_attempts: 1440
+  interval_seconds: 10
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 2
+  decode_nodes: 2
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 8
+
+infra:
+  etcd_nats_dedicated_node: true
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+    VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024"
+    VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048"
+    # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
+    UCX_MEMTYPE_CACHE: "n"
+    UCX_MEMTYPE_REG_WHOLE: "n"
+    UCX_TLS: "cuda_copy,cuda_ipc,tcp"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "y"
+    NCCL_P2P_LEVEL: NVL
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+    # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
+    UCX_MEMTYPE_CACHE: "n"
+    UCX_MEMTYPE_REG_WHOLE: "n"
+    UCX_TLS: "cuda_copy,cuda_ipc,tcp"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "y"
+    NCCL_P2P_LEVEL: NVL
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: 9280
+      max-num-seqs: 16
+      max-num-batched-tokens: 32768
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      no-async-scheduling: true
+      block-size: 256
+      gpu-memory-utilization: 0.8
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+      numa-bind: true
+      offload-group-size: 3
+      offload-num-in-group: 1
+      offload-prefetch-step: 2
+      # offload-params: "w13_weight w2_weight w13_weight_scale w2_weight_scale wq_b wo_a wo_b shared_experts"
+      tokenizer-mode: deepseek_v4
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 9280
+      max-num-seqs: 128
+      max-cudagraph-capture-size: 128
+      max-num-batched-tokens: 128
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+      tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+    DSV4: "true"
+
+identity:
+  container:
+    image: "vllm/vllm-openai:v0.20.0-ubuntu2404"
+  frameworks:
+    dynamo: "1.2.0.dev20260426"
+    vllm: "0.20.0"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml
new file mode 100644
index 000000000..2f0fa98e6
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml
@@ -0,0 +1,150 @@
+name: "dsv4-vllm-disagg-gb200-2p1d-dep8-dep8-offload"
+
+# Mirrored from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch (PR #77):
+#   recipes/vllm/deepseek-v4-pro-sa/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
+#
+# Topology: 2 prefill (DEP=8 each) + 1 decode (DEP=8). 6 nodes.
+# c4096-tuned variant (decode max-num-seqs=512).
+#
+# Local deltas vs upstream:
+#   * model.path alias renamed deepseekv4-fp4 -> deepseek-v4-pro to match
+#     SRT_SLURM_MODEL_PREFIX in runners/launch_gb200-nv.sh.
+#   * model.container set to vllm/vllm-openai:v0.20.0-ubuntu2404 to
+#     match nvidia-master.yaml image (which the launch script registers as
+#     the alias key in srtslurm.yaml). Upstream variants ship either the
+#     non-dynamo floating tag or a sha256 pin.
+#   * slurm.time_limit + health_check set to 8h / 1440 attempts to
+#     absorb cold-cache /mnt/numa1 model loads.
+model:
+  path: "deepseek-v4-pro"
+  container: "vllm/vllm-openai:v0.20.0-ubuntu2404"
+  precision: "fp4"
+
+dynamo:
+  install: true
+  wheel: "1.2.0.dev20260426"
+
+setup_script: vllm-container-deps.sh
+
+slurm:
+  time_limit: "8:00:00"
+
+health_check:
+  max_attempts: 1440
+  interval_seconds: 10
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 4
+  decode_nodes: 2
+  prefill_workers: 2
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 8
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+    VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024"
+    VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048"
+    # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
+    UCX_MEMTYPE_CACHE: "n"
+    UCX_MEMTYPE_REG_WHOLE: "n"
+    UCX_TLS: "cuda_copy,cuda_ipc,tcp"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "y"
+    NCCL_P2P_LEVEL: NVL
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+    # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
+    UCX_MEMTYPE_CACHE: "n"
+    UCX_MEMTYPE_REG_WHOLE: "n"
+    UCX_TLS: "cuda_copy,cuda_ipc,tcp"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "y"
+    NCCL_P2P_LEVEL: NVL
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: 16384
+      max-num-seqs: 16
+      max-num-batched-tokens: 32768
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      no-async-scheduling: true
+      block-size: 256
+      gpu-memory-utilization: 0.8
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+      numa-bind: true
+      offload-group-size: 3
+      offload-num-in-group: 1
+      offload-prefetch-step: 2
+      # offload-params: "w13_weight w2_weight w13_weight_scale w2_weight_scale wq_b wo_a wo_b shared_experts"
+      tokenizer-mode: deepseek_v4
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 16384
+      max-num-seqs: 512
+      max-cudagraph-capture-size: 512
+      max-num-batched-tokens: 512
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+      tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+    DSV4: "true"
+
+identity:
+  container:
+    image: "vllm/vllm-openai:v0.20.0-ubuntu2404"
+  frameworks:
+    dynamo: "1.2.0.dev20260426"
+    vllm: "0.20.0"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
similarity index 90%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
index 9848edb01..2f0fa98e6 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
@@ -127,14 +127,20 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "4096"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "24"
+    DSV4: "true"
 
 identity:
   container:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
similarity index 90%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
index 3f3803d3b..85ff907e3 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
@@ -127,14 +127,20 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "4096"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "40"
+    DSV4: "true"
 
 identity:
   container:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml
new file mode 100644
index 000000000..85ff907e3
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml
@@ -0,0 +1,150 @@
+name: "dsv4-vllm-disagg-gb200-3p1d-dep8-dep16-offload"
+
+# Mirrored from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch (PR #77):
+#   recipes/vllm/deepseek-v4-pro-sa/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
+#
+# Topology: 3 prefill (DEP=8) + 1 wide decode (DEP=16). 10 nodes.
+# c4096-tuned variant.
+#
+# Local deltas vs upstream:
+#   * model.path alias renamed deepseekv4-fp4 -> deepseek-v4-pro to match
+#     SRT_SLURM_MODEL_PREFIX in runners/launch_gb200-nv.sh.
+#   * model.container set to vllm/vllm-openai:v0.20.0-ubuntu2404 to
+#     match nvidia-master.yaml image (which the launch script registers as
+#     the alias key in srtslurm.yaml). Upstream variants ship either the
+#     non-dynamo floating tag or a sha256 pin.
+#   * slurm.time_limit + health_check set to 8h / 1440 attempts to
+#     absorb cold-cache /mnt/numa1 model loads.
+model:
+  path: "deepseek-v4-pro"
+  container: "vllm/vllm-openai:v0.20.0-ubuntu2404"
+  precision: "fp4"
+
+dynamo:
+  install: true
+  wheel: "1.2.0.dev20260426"
+
+setup_script: vllm-container-deps.sh
+
+slurm:
+  time_limit: "8:00:00"
+
+health_check:
+  max_attempts: 1440
+  interval_seconds: 10
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 6
+  decode_nodes: 4
+  prefill_workers: 3
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 16
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+    VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024"
+    VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048"
+    # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
+    UCX_MEMTYPE_CACHE: "n"
+    UCX_MEMTYPE_REG_WHOLE: "n"
+    UCX_TLS: "cuda_copy,cuda_ipc,tcp"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "y"
+    NCCL_P2P_LEVEL: NVL
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+    # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
+    UCX_MEMTYPE_CACHE: "n"
+    UCX_MEMTYPE_REG_WHOLE: "n"
+    UCX_TLS: "cuda_copy,cuda_ipc,tcp"
+    UCX_CUDA_IPC_ENABLE_MNNVL: "y"
+    NCCL_P2P_LEVEL: NVL
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: 16384
+      max-num-seqs: 16
+      max-num-batched-tokens: 32768
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      no-async-scheduling: true
+      block-size: 256
+      gpu-memory-utilization: 0.8
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+      numa-bind: true
+      offload-group-size: 3
+      offload-num-in-group: 1
+      offload-prefetch-step: 2
+      # offload-params: "w13_weight w2_weight w13_weight_scale w2_weight_scale wq_b wo_a wo_b shared_experts"
+      tokenizer-mode: deepseek_v4
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 16384
+      max-num-seqs: 256
+      max-cudagraph-capture-size: 256
+      max-num-batched-tokens: 256
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+      tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro"
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "40"
+    DSV4: "true"
+
+identity:
+  container:
+    image: "vllm/vllm-openai:v0.20.0-ubuntu2404"
+  frameworks:
+    dynamo: "1.2.0.dev20260426"
+    vllm: "0.20.0"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml
similarity index 91%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml
index 137e3017a..b6e334b02 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml
@@ -131,14 +131,19 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "1"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+    DSV4: "true"
 
 identity:
   container:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml
similarity index 91%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml
index 20672bfdf..3d924449d 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml
@@ -133,14 +133,19 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "256x512"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "40"
+    DSV4: "true"
 
 identity:
   container:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml
similarity index 92%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml
index fe3840109..e749199ed 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml
@@ -134,14 +134,19 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "4096"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+    DSV4: "true"
 
 identity:
   model:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml
similarity index 91%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml
index 754d61662..d6b2c11f2 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml
@@ -131,14 +131,19 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "4096"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "32"
+    DSV4: "true"
 
 identity:
   container:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml
similarity index 91%
rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml
rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml
index bf8e6c452..0e40d5d40 100644
--- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml
+++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml
@@ -131,14 +131,19 @@ backend:
       no-disable-hybrid-kv-cache-manager: true
       enable-sleep-mode: true
       tokenizer-mode: deepseek_v4
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
 benchmark:
-  type: "sa-bench"
-  isl: 8192
-  osl: 1024
-  concurrencies: "256"
-  req_rate: "inf"
-  use_chat_template: true
-  custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer"
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "8"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "16"
+    DSV4: "true"
 
 identity:
   container:
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..49a38528d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,131 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# STP (no speculative decoding)
+# concurrency: 666
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 000000000..c83b4c67b
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,135 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# STP (no speculative decoding)
+# concurrency: 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: true
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..e5a833580
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,223 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=768
+# STP (no speculative decoding)
+# Covers all dep8 concurrencies: 4301, 6452
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 768
+      max_num_tokens: 768
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+          - 264
+          - 272
+          - 280
+          - 288
+          - 296
+          - 304
+          - 312
+          - 320
+          - 328
+          - 336
+          - 344
+          - 352
+          - 360
+          - 368
+          - 376
+          - 384
+          - 392
+          - 400
+          - 408
+          - 416
+          - 424
+          - 432
+          - 440
+          - 448
+          - 456
+          - 464
+          - 472
+          - 480
+          - 488
+          - 496
+          - 504
+          - 512
+          - 520
+          - 528
+          - 536
+          - 544
+          - 552
+          - 560
+          - 568
+          - 576
+          - 584
+          - 592
+          - 600
+          - 608
+          - 616
+          - 624
+          - 632
+          - 640
+          - 648
+          - 656
+          - 664
+          - 672
+          - 680
+          - 688
+          - 696
+          - 704
+          - 712
+          - 720
+          - 728
+          - 736
+          - 744
+          - 752
+          - 760
+          - 768
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "12"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..a56150450
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,144 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=128
+# STP (no speculative decoding)
+# Covers all gen4tep8 concurrencies: 4, 192, 360, 668
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..ffb109b8d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,128 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=8
+# STP (no speculative decoding)
+# Covers all gen5tep4 concurrencies: 5, 15, 30, 55
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
new file mode 100644
index 000000000..f75876142
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,159 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# STP (no speculative decoding)
+# concurrency: 4301
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..7fdf9daea
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,143 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch128_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# STP (no speculative decoding)
+# concurrency: 4301
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "32"
+    TOTAL_GPUS: "40"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..bbc7627ee
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,132 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP4/EP4, max_batch=32
+# Single concurrency point: 156
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  # Decode: 4 workers x TP4 = 16 GPUs = 4 nodes
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "20"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..5a0b04c91
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1
+# Single concurrency point: 4
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  # Decode: 4 workers x TP8 = 32 GPUs = 8 nodes
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "36"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..90d294ff5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,132 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=16
+# Covers all concurrencies: 5, 15, 30, 60, 105
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  # Decode: 5 workers x TP4 = 20 GPUs = 5 nodes
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      # max_batch_size=16 covers all concs: 5, 15, 30, 60, 105
+      # cuda_graph pre-compiles graphs for each batch size up to the max
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
new file mode 100644
index 000000000..8cc508d5e
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,130 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch16_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=16
+# concurrency: 333
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 2 workers x TP4 = 8 GPUs = 2 nodes
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "24"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 000000000..528b0b4f9
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,132 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep16_batch32_eplb0_mtp0"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 615
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 3 workers x TP4 = 12 GPUs = 3 nodes
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
new file mode 100644
index 000000000..d0dbf80f0
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,161 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256
+# Single concurrency point: 2151
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 5 workers x TP4 = 20 GPUs = 5 nodes
+  prefill_nodes: 5
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP8 = 8 GPUs = 2 nodes
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      # max_batch_size=256, cuda_graph pre-compiles graphs for all batch sizes up to 256
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 000000000..6eb391bba
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,144 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 7 workers x TP4 = 28 GPUs = 7 nodes
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "44"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml
new file mode 100644
index 000000000..c5230d9e5
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml
@@ -0,0 +1,107 @@
+name: "kimi-vllm-disagg-gb200-1p1d-dep4-dep16"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 16
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 4096
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 4096
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 512
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "20"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml
new file mode 100644
index 000000000..0992a5091
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml
@@ -0,0 +1,104 @@
+name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_prefill: 4
+  gpus_per_decode: 4
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 1024
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 4
+      pipeline-parallel-size: 1
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 1024
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 1024
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "20"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml
new file mode 100644
index 000000000..5670a9d54
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml
@@ -0,0 +1,104 @@
+name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_prefill: 4
+  gpus_per_decode: 4
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 4
+      pipeline-parallel-size: 1
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 16
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 16
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "4"
+    TOTAL_GPUS: "20"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml
new file mode 100644
index 000000000..cecacdfd7
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml
@@ -0,0 +1,107 @@
+name: "kimi-vllm-disagg-gb200-3p1d-dep4-dep16"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 3
+  decode_nodes: 4
+  prefill_workers: 3
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 16
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 256
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 256
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "28"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml
new file mode 100644
index 000000000..259db9436
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml
@@ -0,0 +1,107 @@
+name: "kimi-vllm-disagg-gb200-5p1d-dep4-dep8"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 5
+  decode_nodes: 2
+  prefill_workers: 5
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 8
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 512
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 512
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "8"
+    TOTAL_GPUS: "28"
diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml
new file mode 100644
index 000000000..0a26d118d
--- /dev/null
+++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml
@@ -0,0 +1,107 @@
+name: "kimi-vllm-disagg-gb200-6p1d-dep4-dep16"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 6
+  decode_nodes: 4
+  prefill_workers: 6
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 16
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 512
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 512
+
+# Bench client lives in this repo; mounted into the bench container at
+# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract.
+container_mounts:
+  "$INFMAX_WORKSPACE": "/infmax-workspace"
+
+benchmark:
+  type: "custom"
+  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
+  env:
+    PREFILL_GPUS: "4"
+    DECODE_GPUS: "16"
+    TOTAL_GPUS: "40"
diff --git a/benchmarks/multi_node/srt_bench.sh b/benchmarks/multi_node/srt_bench.sh
new file mode 100755
index 000000000..aeb1ef502
--- /dev/null
+++ b/benchmarks/multi_node/srt_bench.sh
@@ -0,0 +1,127 @@
+#!/usr/bin/env bash
+# Multi-node bench-serving wrapper invoked by srt-slurm via
+# `benchmark.type: custom`. srt-slurm owns server bring-up; this script runs
+# inside the same job's benchmark container against the already-ready
+# frontend on the head node, then writes one results JSON per concurrency to
+# /logs/sa-bench_isl_<ISL>_osl_<OSL>/ — the same path the launcher's existing
+# result-harvesters glob.
+#
+# This is a thin loop on top of run_benchmark_serving() in benchmark_lib.sh
+# (the same shim every single-node bench script uses), so any future change
+# to bench-serving CLI conventions, profiling, server-health monitoring, etc.
+# applies here automatically.
+#
+# Reads from env. Most of these are *already* exported by
+# .github/workflows/benchmark-multinode-tmpl.yml at the workflow step level
+# and propagate down through the launcher → srtctl → srun (default
+# --export=ALL) → pyxis → bench container, so recipes do not need to
+# re-declare them in `benchmark.env`:
+#
+#   $MODEL              served-model-name; matches workflow `inputs.model`
+#   $ISL $OSL           sequence lengths
+#   $CONC_LIST          space-separated concurrency list
+#   $DISAGG             "true" / "false" — disagg vs aggregated
+#   $RANDOM_RANGE_RATIO 0.8 (workflow default)
+#
+# Per-recipe knobs that *do* live in `benchmark.env` (no workflow equivalent):
+#   PREFILL_GPUS        per-prefill-worker GPU count (filename component)
+#   DECODE_GPUS         per-decode-worker GPU count (filename component)
+#   TOTAL_GPUS          sum across all workers (filename component)
+#
+# Optional per-recipe overrides (defaults shown):
+#   MODEL_NAME=$MODEL          override when server's served-model-name differs
+#                              from the master-yaml `model:` field
+#   PORT=8000                  frontend port reachable at localhost
+#   BACKEND=openai             generic OpenAI-API; works against the dynamo frontend
+#   ENDPOINT=                  empty -> bench_serving.py default (/v1/completions)
+#   NUM_PROMPTS_MULT=10        prompts per conc = NUM_PROMPTS_MULT * conc
+#   USE_CHAT_TEMPLATE=true
+#   DSV4=false                 sets the --dsv4 flag (auto-enables chat template)
+#   TRUST_REMOTE_CODE=true
+#
+# The InferenceX repo is bind-mounted at /infmax-workspace via each recipe's
+# `container_mounts` block. Model files are auto-mounted at /model by srtctl
+# (RuntimeContext.create unconditionally adds the mount when model.path is a
+# local path), so we point --tokenizer at /model to load the tokenizer from
+# the same files the engine is serving — no HF Hub dependency.
+set -euo pipefail
+
+INFMAX_WS="${INFMAX_CONTAINER_WORKSPACE:-/infmax-workspace}"
+# shellcheck disable=SC1091
+source "$INFMAX_WS/benchmarks/benchmark_lib.sh"
+
+check_env_vars MODEL ISL OSL CONC_LIST DISAGG \
+               PREFILL_GPUS DECODE_GPUS TOTAL_GPUS
+
+MODEL_NAME="${MODEL_NAME:-$MODEL}"
+PORT="${PORT:-8000}"
+# `openai` matches every dynamo frontend (frontend exposes a generic OpenAI-
+# compatible API regardless of the underlying engine). Recipes that need
+# /v1/chat/completions can override ENDPOINT.
+BACKEND="${BACKEND:-openai}"
+ENDPOINT="${ENDPOINT:-}"
+RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-0.8}"
+NUM_PROMPTS_MULT="${NUM_PROMPTS_MULT:-10}"
+USE_CHAT_TEMPLATE="${USE_CHAT_TEMPLATE:-true}"
+DSV4="${DSV4:-false}"
+TRUST_REMOTE_CODE="${TRUST_REMOTE_CODE:-true}"
+
+RESULT_DIR="/logs/sa-bench_isl_${ISL}_osl_${OSL}"
+mkdir -p "$RESULT_DIR"
+
+# srt-slurm worker containers don't always ship bench_serving.py's runtime
+# deps (datasets in particular). Install missing ones into a system-site-
+# packages venv so we don't perturb the framework's own packages.
+ensure_bench_serving_deps() {
+    local deps=(aiohttp numpy pandas datasets Pillow tqdm transformers huggingface_hub)
+    if python3 -c "import aiohttp, numpy, pandas, datasets, PIL, tqdm, transformers, huggingface_hub" 2>/dev/null; then
+        return
+    fi
+    local venv="/tmp/srt-bench-venv"
+    [[ -d "$venv" ]] || python3 -m venv --system-site-packages "$venv"
+    # shellcheck disable=SC1091
+    source "$venv/bin/activate"
+    pip install --quiet "${deps[@]}"
+}
+ensure_bench_serving_deps
+
+curl -fsS "http://localhost:${PORT}/v1/models" >/dev/null || {
+    echo "ERROR: frontend at http://localhost:${PORT} did not respond on /v1/models" >&2
+    exit 66
+}
+ulimit -n 65536 2>/dev/null || true
+
+# CONC_LIST from the workflow is space-separated; bench loops one run per value.
+read -r -a CONC_LIST_ARR <<< "$CONC_LIST"
+
+for conc in "${CONC_LIST_ARR[@]}"; do
+    if [[ "$DISAGG" == "true" ]]; then
+        result_filename="results_concurrency_${conc}_gpus_${TOTAL_GPUS}_ctx_${PREFILL_GPUS}_gen_${DECODE_GPUS}"
+    else
+        result_filename="results_concurrency_${conc}_gpus_${TOTAL_GPUS}"
+    fi
+    echo "=== conc=$conc → $RESULT_DIR/${result_filename}.json ==="
+
+    args=(
+        --model "$MODEL_NAME"
+        --tokenizer /model
+        --port "$PORT"
+        --backend "$BACKEND"
+        --input-len "$ISL"
+        --output-len "$OSL"
+        --random-range-ratio "$RANDOM_RANGE_RATIO"
+        --num-prompts "$((conc * NUM_PROMPTS_MULT))"
+        --max-concurrency "$conc"
+        --result-filename "$result_filename"
+        --result-dir "$RESULT_DIR"
+        --bench-serving-dir "$INFMAX_WS"
+    )
+    [[ -n "$ENDPOINT" ]]                    && args+=(--endpoint "$ENDPOINT")
+    [[ "$USE_CHAT_TEMPLATE" == "true" ]]    && args+=(--use-chat-template)
+    [[ "$DSV4" == "true" ]]                 && args+=(--dsv4)
+    [[ "$TRUST_REMOTE_CODE" == "true" ]]    && args+=(--trust-remote-code)
+
+    run_benchmark_serving "${args[@]}"
+done
+
+echo "Done. Results in $RESULT_DIR."
diff --git a/runners/launch_b200-cw.sh b/runners/launch_b200-cw.sh
index 0b2dbf305..fbdd60554 100644
--- a/runners/launch_b200-cw.sh
+++ b/runners/launch_b200-cw.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 export HF_HUB_CACHE_MOUNT="/tmp/gharunner/hf-hub-cache"
 export PORT=8888
 
@@ -16,7 +18,7 @@ if [[ ! -f "$BENCH_SCRIPT" ]]; then
 fi
 
 PARTITION="b200"
-SQUASH_FILE="/tmp/gharunner/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/tmp/gharunner/squash/$(sanitize_image_filename "$IMAGE").sqsh"
 LOCK_FILE="${SQUASH_FILE}.lock"
 
 # TODO(Cam): lmsysorg/sglang:deepseek-v4-blackwell installs sglang editable at
diff --git a/runners/launch_b200-dgxc.sh b/runners/launch_b200-dgxc.sh
index edf5db957..3e294f859 100644
--- a/runners/launch_b200-dgxc.sh
+++ b/runners/launch_b200-dgxc.sh
@@ -4,6 +4,8 @@
 SLURM_PARTITION="gpu"
 SLURM_ACCOUNT="benchmark"
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 set -x
 
 if [[ "$IS_MULTINODE" == "true" ]]; then
@@ -29,35 +31,14 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
     fi
     export SERVED_MODEL_NAME=$MODEL
 
-    echo "Cloning srt-slurm repository..."
-    SRT_REPO_DIR="srt-slurm"
-    if [ -d "$SRT_REPO_DIR" ]; then
-        echo "Removing existing $SRT_REPO_DIR..."
-        rm -rf "$SRT_REPO_DIR"
-    fi
-
-    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR" || exit 1
-    git checkout sa-submission-q2-2026
-
-    echo "Installing srtctl..."
-    export UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin"
-    curl -LsSf https://astral.sh/uv/install.sh | sh
-    export PATH="$UV_INSTALL_DIR:$PATH"
-
-    uv venv "$GITHUB_WORKSPACE/.venv"
-    source "$GITHUB_WORKSPACE/.venv/bin/activate"
-    uv pip install -e .
-
-    if ! command -v srtctl &> /dev/null; then
-        echo "Error: Failed to install srtctl"
-        exit 1
-    fi
+    UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" \
+    UV_VENV_DIR="$GITHUB_WORKSPACE/.venv" \
+        clone_and_install_srtctl || exit 1
 
     # Map container images to local squash files
     NGINX_IMAGE="nginx:1.27.4"
-    SQUASH_FILE="/home/sa-shared/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
-    NGINX_SQUASH_FILE="/home/sa-shared/containers/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/home/sa-shared/containers/$(sanitize_image_filename "$IMAGE").sqsh"
+    NGINX_SQUASH_FILE="/home/sa-shared/containers/$(sanitize_image_filename "$NGINX_IMAGE").sqsh"
 
     # Import containers via enroot
     enroot import -o $SQUASH_FILE docker://$IMAGE
@@ -105,7 +86,7 @@ EOF
     echo "Submitting job with srtctl..."
 
     if [[ -z "$CONFIG_FILE" ]]; then
-        echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2
+        echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2
         echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2
         exit 1
     fi
@@ -250,7 +231,7 @@ EOF
 else
 
     HF_HUB_CACHE_MOUNT="/scratch/fsw/gharunners/hf-hub-cache"
-    SQUASH_FILE="/home/sa-shared/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/home/sa-shared/containers/$(sanitize_image_filename "$IMAGE").sqsh"
     FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
     SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
     # Prefer a framework-tagged script (e.g. dsv4_fp4_b200_vllm.sh) so models
diff --git a/runners/launch_b300-nv.sh b/runners/launch_b300-nv.sh
index 3c855e805..23f75ac80 100644
--- a/runners/launch_b300-nv.sh
+++ b/runners/launch_b300-nv.sh
@@ -4,6 +4,8 @@
 SLURM_PARTITION="batch_1"
 SLURM_ACCOUNT="benchmark"
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 set -x
 
 if [[ "$IS_MULTINODE" == "true" ]]; then
@@ -30,35 +32,14 @@ else
     exit 1
 fi
 
-echo "Cloning srt-slurm repository..."
-SRT_REPO_DIR="srt-slurm"
-if [ -d "$SRT_REPO_DIR" ]; then
-    echo "Removing existing $SRT_REPO_DIR..."
-    rm -rf "$SRT_REPO_DIR"
-fi
-
-git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-cd "$SRT_REPO_DIR" || exit 1
-git checkout sa-submission-q2-2026
-
-echo "Installing srtctl..."
-export UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin"
-curl -LsSf https://astral.sh/uv/install.sh | sh
-export PATH="$UV_INSTALL_DIR:$PATH"
-
-uv venv "$GITHUB_WORKSPACE/.venv"
-source "$GITHUB_WORKSPACE/.venv/bin/activate"
-uv pip install -e .
-
-if ! command -v srtctl &> /dev/null; then
-    echo "Error: Failed to install srtctl"
-    exit 1
-fi
+UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" \
+UV_VENV_DIR="$GITHUB_WORKSPACE/.venv" \
+    clone_and_install_srtctl || exit 1
 
 # Map container images to local squash files
 NGINX_IMAGE="nginx:1.27.4"
-SQUASH_FILE="/data/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
-NGINX_SQUASH_FILE="/data/squash/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/data/squash/$(sanitize_image_filename "$IMAGE").sqsh"
+NGINX_SQUASH_FILE="/data/squash/$(sanitize_image_filename "$NGINX_IMAGE").sqsh"
 
 # Import containers via enroot
 srun -N 1 -A $SLURM_ACCOUNT -p $SLURM_PARTITION bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
@@ -108,7 +89,7 @@ export INFMAX_WORKSPACE="$GITHUB_WORKSPACE"
 echo "Submitting job with srtctl..."
 
 if [[ -z "$CONFIG_FILE" ]]; then
-    echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2
+    echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2
     echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2
     exit 1
 fi
@@ -258,7 +239,7 @@ else
     elif [[ "$MODEL_PREFIX" == "dsv4" ]]; then
         export MODEL="$HF_HUB_CACHE_MOUNT/dsv4-pro"
     fi
-    SQUASH_FILE="/data/home/sa-shared/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/data/home/sa-shared/gharunners/squash/$(sanitize_image_filename "$IMAGE").sqsh"
     SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
     # Prefer a framework-tagged script (e.g. dsv4_fp4_b300_sglang.sh) so models
     # with multiple inference engines can coexist; fall back to the historical
diff --git a/runners/launch_gb200-nv.sh b/runners/launch_gb200-nv.sh
index 333e94359..c8c822c6f 100755
--- a/runners/launch_gb200-nv.sh
+++ b/runners/launch_gb200-nv.sh
@@ -2,6 +2,8 @@
 
 # This script sets up the environment and launches multi-node benchmarks
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 set -x
 
 # MODEL_PATH: Override with pre-downloaded paths on GB200 runner
@@ -62,8 +64,8 @@ export SLURM_ACCOUNT="benchmark"
 
 NGINX_IMAGE="nginx:1.27.4"
 
-SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
-NGINX_SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(sanitize_image_filename "$IMAGE").sqsh"
+NGINX_SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(sanitize_image_filename "$NGINX_IMAGE").sqsh"
 
 enroot import -o $SQUASH_FILE docker://$IMAGE
 enroot import -o $NGINX_SQUASH_FILE docker://$NGINX_IMAGE
@@ -125,57 +127,19 @@ PY
 fi
 
 
-# srt-slurm path requires a CONFIG_FILE pointing to a recipe YAML.
-# Without it, srtctl apply scans every YAML in the repo and submits hundreds of jobs.
+# srt-slurm path requires CONFIG_FILE (set by benchmark-multinode-tmpl.yml from
+# the search-space `recipe:` field). Without it, srtctl apply scans every YAML
+# in the repo and submits hundreds of jobs.
 if [[ -z "$CONFIG_FILE" ]]; then
-    echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2
+    echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2
     echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2
     exit 1
 fi
 
-echo "Cloning srt-slurm repository..."
-SRT_REPO_DIR="srt-slurm"
-if [ -d "$SRT_REPO_DIR" ]; then
-    echo "Removing existing $SRT_REPO_DIR..."
-    rm -rf "$SRT_REPO_DIR"
-fi
-
-if [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then
-    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR"
-    git checkout aflowers/vllm-gb200-v0.20.0
-    # Use `cp -rT` so if the upstream branch ever ships a stub
-    # `recipes/vllm/deepseek-v4/` directory, we overlay our recipes onto
-    # it rather than nesting (`cp -r src dst` would create
-    # `recipes/vllm/deepseek-v4/deepseek-v4/...` in that case).
-    mkdir -p recipes/vllm/deepseek-v4
-    cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4
-elif [[ $FRAMEWORK == "dynamo-vllm" ]]; then
-    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR"
-    git checkout sa-submission-q2-2026
-elif [[ $FRAMEWORK == "dynamo-trt" && $MODEL_PREFIX == "kimik2.5" ]]; then
-    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR"
-    git checkout sa-submission-q2-2026
-else
-    git clone https://github.com/ishandhanani/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR"
-    git checkout sa-submission-q1-2026
-fi
-
-echo "Installing srtctl..."
-curl -LsSf https://astral.sh/uv/install.sh | sh
-source $HOME/.local/bin/env
-
-uv venv
-source .venv/bin/activate
-uv pip install -e .
-
-if ! command -v srtctl &> /dev/null; then
-    echo "Error: Failed to install srtctl"
-    exit 1
-fi
+# We only clone srt-slurm to install srtctl + pick up its sibling configs
+# (configs/, expert-distributions/, etc). The recipe itself is supplied as an
+# absolute CONFIG_FILE pointing at benchmarks/multi_node/srt-slurm-recipes/.
+clone_and_install_srtctl || exit 1
 
 echo "Configs available at: $SRT_REPO_DIR/"
 
diff --git a/runners/launch_gb300-nv.sh b/runners/launch_gb300-nv.sh
index 5f48ddcec..a0790260e 100644
--- a/runners/launch_gb300-nv.sh
+++ b/runners/launch_gb300-nv.sh
@@ -2,6 +2,8 @@
 
 # This script sets up the environment and launches multi-node benchmarks
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 set -x
 
 export SLURM_PARTITION="batch"
@@ -25,8 +27,8 @@ fi
 
 NGINX_IMAGE="nginx:1.27.4"
 
-SQUASH_FILE="/home/sa-shared/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
-NGINX_SQUASH_FILE="/home/sa-shared/squash/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/home/sa-shared/squash/$(sanitize_image_filename "$IMAGE").sqsh"
+NGINX_SQUASH_FILE="/home/sa-shared/squash/$(sanitize_image_filename "$NGINX_IMAGE").sqsh"
 
 srun --partition=$SLURM_PARTITION --exclusive --time=180 bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
 srun --partition=$SLURM_PARTITION --exclusive --time=180 bash -c "enroot import -o $NGINX_SQUASH_FILE docker://$NGINX_IMAGE"
@@ -36,30 +38,9 @@ export EVAL_ONLY="${EVAL_ONLY:-false}"
 export ISL="$ISL"
 export OSL="$OSL"
 
-echo "Cloning srt-slurm repository..."
-SRT_REPO_DIR="srt-slurm"
-if [ -d "$SRT_REPO_DIR" ]; then
-    echo "Removing existing $SRT_REPO_DIR..."
-    rm -rf "$SRT_REPO_DIR"
-fi
-
-git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-cd "$SRT_REPO_DIR"
-git checkout sa-submission-q2-2026
-
-echo "Installing srtctl..."
-export UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin"
-curl -LsSf https://astral.sh/uv/install.sh | sh
-export PATH="$UV_INSTALL_DIR:$PATH"
-
-uv venv "$GITHUB_WORKSPACE/.venv"
-source "$GITHUB_WORKSPACE/.venv/bin/activate"
-uv pip install -e .
-
-if ! command -v srtctl &> /dev/null; then
-    echo "Error: Failed to install srtctl"
-    exit 1
-fi
+UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" \
+UV_VENV_DIR="$GITHUB_WORKSPACE/.venv" \
+    clone_and_install_srtctl || exit 1
 
 echo "Configs available at: $SRT_REPO_DIR/"
 
@@ -103,7 +84,7 @@ export INFMAX_WORKSPACE="$GITHUB_WORKSPACE"
 echo "Submitting job with srtctl..."
 
 if [[ -z "$CONFIG_FILE" ]]; then
-    echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2
+    echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2
     echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2
     exit 1
 fi
diff --git a/runners/launch_h100-cw.sh b/runners/launch_h100-cw.sh
index f3198ca8c..e036e6219 100644
--- a/runners/launch_h100-cw.sh
+++ b/runners/launch_h100-cw.sh
@@ -1,8 +1,10 @@
 #!/usr/bin/env bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 export HF_HUB_CACHE_MOUNT="/mnt/vast/gharunner/hf-hub-cache"
 PARTITION="h100"
-SQUASH_FILE="/mnt/vast/gharunner/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/mnt/vast/gharunner/squash/$(sanitize_image_filename "$IMAGE").sqsh"
 LOCK_FILE="${SQUASH_FILE}.lock"
 
 set -x
diff --git a/runners/launch_h100-dgxc-slurm.sh b/runners/launch_h100-dgxc-slurm.sh
index 5a2ab64d2..f95816448 100644
--- a/runners/launch_h100-dgxc-slurm.sh
+++ b/runners/launch_h100-dgxc-slurm.sh
@@ -5,6 +5,8 @@ SLURM_PARTITION="hpc-gpu-1"
 SLURM_ACCOUNT="customer"
 SLURM_EXCLUDED_NODELIST="hpc-gpu-1-7"
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 set -x
 
 if [[ "$IS_MULTINODE" == "true" ]]; then
@@ -34,36 +36,13 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
         exit 1
     fi
 
-    echo "Cloning srt-slurm repository..."
-    SRT_REPO_DIR="srt-slurm"
-    if [ -d "$SRT_REPO_DIR" ]; then
-        echo "Removing existing $SRT_REPO_DIR..."
-        rm -rf "$SRT_REPO_DIR"
-    fi
-
-    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR"
-    git checkout sa-submission-q2-2026
-
-    echo "Installing srtctl..."
-    export UV_INSTALL_DIR="/mnt/nfs/sa-shared/.uv/bin"
+    # Pin uv state onto the NFS-shared volume so cluster nodes share a single
+    # cached install, and so the binary persists across runner workspaces.
     export UV_CACHE_DIR="/mnt/nfs/sa-shared/.uv/cache"
     export UV_PYTHON_INSTALL_DIR="/mnt/nfs/sa-shared/.uv/python"
-    mkdir -p "$UV_INSTALL_DIR" "$UV_CACHE_DIR" "$UV_PYTHON_INSTALL_DIR"
-    if ! [ -x "$UV_INSTALL_DIR/uv" ]; then
-        curl -LsSf https://astral.sh/uv/install.sh | sh
-    fi
-    export PATH="$UV_INSTALL_DIR:$PATH"
-    source $UV_INSTALL_DIR/env
-
-    uv venv
-    source .venv/bin/activate
-    uv pip install -e .
-
-    if ! command -v srtctl &> /dev/null; then
-        echo "Error: Failed to install srtctl"
-        exit 1
-    fi
+    mkdir -p "$UV_CACHE_DIR" "$UV_PYTHON_INSTALL_DIR"
+    UV_INSTALL_DIR="/mnt/nfs/sa-shared/.uv/bin" \
+        clone_and_install_srtctl || exit 1
 
     echo "Configs available at: $SRT_REPO_DIR/"
 
@@ -77,7 +56,7 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
     elif [[ $FRAMEWORK == "dynamo-trt" ]]; then
         # TRT-LLM container mapping - convert IMAGE to srt-slurm format (nvcr.io/ -> nvcr.io#)
         CONTAINER_KEY=$(echo "$IMAGE" | sed 's|nvcr.io/|nvcr.io#|')
-        SQUASH_FILE="/mnt/nfs/sa-shared/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"
+        SQUASH_FILE="/mnt/nfs/sa-shared/containers/$(sanitize_image_filename "${IMAGE#nvcr.io/}" +).sqsh"
     fi
 
     export ISL="$ISL"
@@ -126,7 +105,7 @@ EOF
     echo "Submitting job with srtctl..."
 
     if [[ -z "$CONFIG_FILE" ]]; then
-        echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2
+        echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2
         echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2
         exit 1
     fi
@@ -270,7 +249,7 @@ EOF
 else
 
     HF_HUB_CACHE_MOUNT="/mnt/nfs/sa-shared/gharunners/hf-hub-cache/"
-    SQUASH_FILE="/mnt/nfs/lustre/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/mnt/nfs/lustre/containers/$(sanitize_image_filename "$IMAGE").sqsh"
 
     salloc --exclude="$SLURM_EXCLUDED_NODELIST" --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME"
     JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1)
diff --git a/runners/launch_h200-cw.sh b/runners/launch_h200-cw.sh
index 84b40480c..08bbbc757 100644
--- a/runners/launch_h200-cw.sh
+++ b/runners/launch_h200-cw.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 export HF_HUB_CACHE_MOUNT="/mnt/vast/gharunner/hf-hub-cache"
 export PORT=8888
 
@@ -8,7 +10,7 @@ FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
 SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
 
 PARTITION="h200"
-SQUASH_FILE="/mnt/vast/gharunner/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/mnt/vast/gharunner/squash/$(sanitize_image_filename "$IMAGE").sqsh"
 LOCK_FILE="${SQUASH_FILE}.lock"
 
 set -x
diff --git a/runners/launch_h200-dgxc-slurm.sh b/runners/launch_h200-dgxc-slurm.sh
index e11ca7b20..71a64025f 100755
--- a/runners/launch_h200-dgxc-slurm.sh
+++ b/runners/launch_h200-dgxc-slurm.sh
@@ -4,6 +4,8 @@
 SLURM_PARTITION="main"
 SLURM_ACCOUNT="sa-shared"
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 set -x
 
 if [[ "$IS_MULTINODE" == "true" ]]; then
@@ -33,29 +35,7 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
         exit 1
     fi
 
-    echo "Cloning srt-slurm repository..."
-    SRT_REPO_DIR="srt-slurm"
-    if [ -d "$SRT_REPO_DIR" ]; then
-        echo "Removing existing $SRT_REPO_DIR..."
-        rm -rf "$SRT_REPO_DIR"
-    fi
-
-    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
-    cd "$SRT_REPO_DIR"
-    git checkout sa-submission-q2-2026
-
-    echo "Installing srtctl..."
-    curl -LsSf https://astral.sh/uv/install.sh | sh
-    source $HOME/.local/bin/env
-
-    uv venv
-    source .venv/bin/activate
-    uv pip install -e .
-
-    if ! command -v srtctl &> /dev/null; then
-        echo "Error: Failed to install srtctl"
-        exit 1
-    fi
+    clone_and_install_srtctl || exit 1
 
     echo "Configs available at: $SRT_REPO_DIR/"
 
@@ -64,12 +44,12 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
 
     if [[ $FRAMEWORK == "dynamo-sglang" ]]; then
         # SGLang container mapping
-        SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh"
+        SQUASH_FILE="/data/containers/$(sanitize_image_filename "$IMAGE" +).sqsh"
         CONTAINER_KEY="$IMAGE"
     elif [[ $FRAMEWORK == "dynamo-trt" ]]; then
         # TRT-LLM container mapping - convert IMAGE to srt-slurm format (nvcr.io/ -> nvcr.io#)
         CONTAINER_KEY=$(echo "$IMAGE" | sed 's|nvcr.io/|nvcr.io#|')
-        SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh"
+        SQUASH_FILE="/data/containers/$(sanitize_image_filename "${IMAGE#nvcr.io/}" +).sqsh"
     fi
 
     export ISL="$ISL"
@@ -119,7 +99,7 @@ EOF
     echo "Submitting job with srtctl..."
 
     if [[ -z "$CONFIG_FILE" ]]; then
-        echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2
+        echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2
         echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2
         exit 1
     fi
@@ -262,7 +242,7 @@ EOF
 else
 
     HF_HUB_CACHE_MOUNT="/models/gharunners/hf-hub-cache"
-    SQUASH_FILE="/data/gharunners/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/data/gharunners/containers/$(sanitize_image_filename "$IMAGE").sqsh"
 
     # Convert pyxis image format (nvcr.io#path) to docker format (nvcr.io/path) for enroot import
     DOCKER_IMAGE=$(echo "$IMAGE" | sed 's/#/\//g')
diff --git a/runners/launch_h200-nb.sh b/runners/launch_h200-nb.sh
index 9d157a858..849f73699 100644
--- a/runners/launch_h200-nb.sh
+++ b/runners/launch_h200-nb.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 export HF_HUB_CACHE_MOUNT="/mnt/data/gharunners/hf-hub-cache/"
 export PORT=8888
 
@@ -12,7 +14,7 @@ PARTITION="main"
 set -x
 srun --partition=$PARTITION --gres=gpu:$TP --exclusive --job-name="$RUNNER_NAME" \
 --container-image=$IMAGE \
---container-name=$(echo "$IMAGE" | sed 's/[\/:@#]/_/g')-${USER} \
+--container-name=$(sanitize_image_filename "$IMAGE")-${USER} \
 --container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE \
 --container-remap-root \
 --container-writable \
diff --git a/runners/launch_mi300x-amds.sh b/runners/launch_mi300x-amds.sh
index b654c515a..da98f3015 100644
--- a/runners/launch_mi300x-amds.sh
+++ b/runners/launch_mi300x-amds.sh
@@ -1,10 +1,12 @@
 #!/usr/bin/env bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 export HF_HUB_CACHE_MOUNT="/raid/hf-hub-cache/"
 export PORT=8888
 
 PARTITION="compute"
-SQUASH_FILE="/home/gharunner/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/home/gharunner/gharunners/squash/$(sanitize_image_filename "$IMAGE").sqsh"
 LOCK_FILE="${SQUASH_FILE}.lock"
 
 set -x
diff --git a/runners/launch_mi325x-amds.sh b/runners/launch_mi325x-amds.sh
index 67f93a309..200b46838 100644
--- a/runners/launch_mi325x-amds.sh
+++ b/runners/launch_mi325x-amds.sh
@@ -1,10 +1,12 @@
 #!/usr/bin/env bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 export HF_HUB_CACHE_MOUNT="/nfsdata/sa/gharunner/gharunners/hf-hub-cache/"
 export PORT=8888
 
 PARTITION="compute"
-SQUASH_FILE="/nfsdata/sa/gharunner/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+SQUASH_FILE="/nfsdata/sa/gharunner/gharunners/squash/$(sanitize_image_filename "$IMAGE").sqsh"
 LOCK_FILE="${SQUASH_FILE}.lock"
 
 set -x
diff --git a/runners/launch_mi355x-amds.sh b/runners/launch_mi355x-amds.sh
index 152745d4e..a14cfdb2c 100644
--- a/runners/launch_mi355x-amds.sh
+++ b/runners/launch_mi355x-amds.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
 
+source "$(dirname "$0")/../benchmarks/benchmark_lib.sh"
+
 scancel_sync() {
     local jobid=$1
     local timeout=${2:-600}
@@ -182,7 +184,7 @@ else
     SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
 
     PARTITION="compute"
-    SQUASH_FILE="/var/lib/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
+    SQUASH_FILE="/var/lib/squash/$(sanitize_image_filename "$IMAGE").sqsh"
     LOCK_FILE="${SQUASH_FILE}.lock"
 
     set -x
diff --git a/utils/matrix_logic/generate_sweep_configs.py b/utils/matrix_logic/generate_sweep_configs.py
index e543bb4af..44613e8eb 100644
--- a/utils/matrix_logic/generate_sweep_configs.py
+++ b/utils/matrix_logic/generate_sweep_configs.py
@@ -267,6 +267,8 @@ def generate_full_sweep(args, all_config_data, runner_data):
                     seq_len_str = seq_len_to_str(isl, osl)
                     runners_for_entry = runner_nodes_to_use if runner_nodes_to_use else [runner]
 
+                    recipe = bmk.get(Fields.RECIPE.value)
+
                     for runner_value in runners_for_entry:
                         entry = {
                             Fields.IMAGE.value: image,
@@ -285,6 +287,7 @@ def generate_full_sweep(args, all_config_data, runner_data):
                             Fields.EXP_NAME.value: f"{model_code}_{seq_len_str}",
                             Fields.DISAGG.value: disagg,
                             Fields.RUN_EVAL.value: False,  # Default, may be overridden by mark_eval_entries
+                            Fields.RECIPE.value: recipe,
                         }
 
                         validate_matrix_entry(entry, is_multinode)
@@ -463,6 +466,7 @@ def get_lowest_conc(search_space_entry):
                 Fields.SPEC_DECODING.value, "none")
             prefill_config = lowest_conc_entry[Fields.PREFILL.value]
             decode_config = lowest_conc_entry[Fields.DECODE.value]
+            recipe = lowest_conc_entry.get(Fields.RECIPE.value)
 
             for node in runner_nodes:
                 entry = {
@@ -494,6 +498,7 @@ def get_lowest_conc(search_space_entry):
                     Fields.EXP_NAME.value: f"{model_code}_test",
                     Fields.DISAGG.value: disagg,
                     Fields.RUN_EVAL.value: False,
+                    Fields.RECIPE.value: recipe,
                 }
                 matrix_values.append(validate_matrix_entry(entry, is_multinode=True))
         else:
@@ -620,6 +625,7 @@ def generate_test_config_sweep(args, all_config_data):
                         Fields.EXP_NAME.value: f"{model_code}_{seq_len_str}",
                         Fields.DISAGG.value: disagg,
                         Fields.RUN_EVAL.value: False,
+                        Fields.RECIPE.value: bmk.get(Fields.RECIPE.value),
                     }
                     matrix_values.append(validate_matrix_entry(entry, is_multinode=True))
                 else:
diff --git a/utils/matrix_logic/validation.py b/utils/matrix_logic/validation.py
index ce10840b5..7f1fa3326 100644
--- a/utils/matrix_logic/validation.py
+++ b/utils/matrix_logic/validation.py
@@ -1,3 +1,5 @@
+from pathlib import Path
+
 from pydantic import BaseModel, Field, ValidationError, ConfigDict, model_validator
 from typing import List, Optional, Union, Literal
 from enum import Enum
@@ -5,6 +7,11 @@
 import pprint
 import yaml
 
+# Repo-relative root for first-class srt-slurm recipes referenced by the
+# `recipe:` field on multi-node search-space entries. Resolved against the
+# repository root (parent of utils/) so callers can run from any cwd.
+RECIPES_ROOT = Path(__file__).resolve().parents[2] / "benchmarks" / "multi_node" / "srt-slurm-recipes"
+
 """
     The below class defines the field names expected to be present in the JSON entries
     for both single-node and multi-node configurations.
@@ -44,6 +51,7 @@ class Fields(Enum):
     BATCH_SIZE = 'batch-size'
     MAX_NUM_TOKENS = 'max-num-tokens'
     ADDITIONAL_SETTINGS = 'additional-settings'
+    RECIPE = 'recipe'
 
     # Matrix entry fields
     CONC = 'conc'
@@ -131,6 +139,11 @@ class MultiNodeMatrixEntry(BaseModel):
     run_eval: bool = Field(alias=Fields.RUN_EVAL.value)
     eval_only: bool = Field(alias=Fields.EVAL_ONLY.value, default=False)
     eval_conc: Optional[int] = Field(default=None, alias=Fields.EVAL_CONC.value)
+    # Path under benchmarks/multi_node/srt-slurm-recipes/ identifying the
+    # srt-slurm recipe to dispatch. May carry an `:override[N]` suffix that the
+    # launcher strips before resolving the file on disk. Optional because not
+    # every multi-node config uses srt-slurm.
+    recipe: Optional[str] = None
 
 
 def validate_matrix_entry(entry: dict, is_multinode: bool) -> dict:
@@ -234,11 +247,31 @@ class MultiNodeSearchSpaceEntry(BaseModel):
         default=None, alias=Fields.CONC_END.value)
     conc_list: Optional[List[int]] = Field(
         default=None, alias=Fields.CONC_LIST.value)
+    # First-class srt-slurm recipe reference. Path is relative to
+    # benchmarks/multi_node/srt-slurm-recipes/ and may carry an
+    # `:override[N]` suffix to select an in-yaml override section.
+    recipe: Optional[str] = None
 
     @model_validator(mode='after')
     def validate_conc_fields(self):
         return _validate_conc_fields(self)
 
+    @model_validator(mode='after')
+    def validate_recipe_exists(self):
+        if self.recipe is None:
+            return self
+        # Strip `:override[...]` suffix used by sglang-style recipes that
+        # carry multiple override sections in one file.
+        recipe_path = self.recipe.split(':', 1)[0]
+        full_path = RECIPES_ROOT / recipe_path
+        if not full_path.is_file():
+            raise ValueError(
+                f"Recipe file not found: '{self.recipe}' "
+                f"(resolved to '{full_path}'). "
+                f"Recipes must live under benchmarks/multi_node/srt-slurm-recipes/."
+            )
+        return self
+
 
 class SingleNodeSeqLenConfig(BaseModel):
     """Single node sequence length configuration."""