[NVIDIA] [WIP] MTP for dsr1 TRT without chat template by ankursingh-nv · Pull Request #412 · SemiAnalysisAI/InferenceX

ankursingh-nv · 2026-01-12T18:33:44Z

Note

Introduces MTP-enabled TRT configs and end-to-end launch support for DeepSeek-R1 on NVIDIA single-node.

Adds dsr1-fp4-b200-trt-mtp, dsr1-fp8-b200-trt-mtp, dsr1-fp8-h200-trt-mtp with spec-decoding: mtp, updated TP/EP/DP-attn search spaces, and notes for MTP sizing
New benchmark scripts: benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh, dsr1_fp8_b200_trt_mtp_slurm.sh, dsr1_fp8_h200_trt_mtp_slurm.sh configuring MOE backend, DP attention, and MTP server options
Runner scripts (runners/launch_*) detect SPEC_DECODING=mtp and select _mtp benchmark scripts via suffix
benchmarks/benchmark_lib.sh: add optional --use-chat-template flag and refactor to build the benchmark command array
Update perf-changelog.yaml to record new MTP TRT single-node configurations

^{Written by Cursor Bugbot for commit 5f34e19. This will update automatically on new commits. Configure here.}

- Add dsr1_fp4_b200_trt_mtp_slurm.sh with MTP support - Add dsr1_fp8_b200_trt_mtp_slurm.sh with MTP support - Add dsr1_fp8_h200_trt_mtp_slurm.sh with MTP support - Refactored to use benchmark_lib.sh utilities - Use wait_for_server_ready and run_benchmark_serving functions

- Extended benchmark_lib.sh run_benchmark_serving() to support optional --use-chat-template flag - Added --use-chat-template to all three MTP benchmark scripts - This is required for MTP mode to work correctly

- Add dsr1-fp4-b200-trt-mtp configuration with EP/DP_ATTN/MTP logic - Add dsr1-fp8-b200-trt-mtp configuration with EP/DP_ATTN/MTP logic - Add dsr1-fp8-h200-trt-mtp configuration with EP/DP_ATTN/MTP logic - Configurations align with benchmark script logic for dynamic EP_SIZE, MOE_BACKEND, and MTP values

…vars - Remove duplicate EP_SIZE/DP_ATTENTION calculation logic from MTP scripts - MTP scripts now receive EP_SIZE and DP_ATTENTION as env vars from YAML config (like non-MTP scripts) - Only calculate MOE_BACKEND and MTP values based on DP_ATTENTION flag - Simplifies scripts from 156/117/112 lines to 104 lines each - Eliminates redundant logic between YAML configs and bash scripts

- Fix dsr1-fp4-b200-trt-mtp conc ranges to match EP_SIZE conditions - Fix dsr1-fp8-b200-trt-mtp conc ranges to match DP_ATTENTION conditions - Fix dsr1-fp8-h200-trt-mtp conc ranges to match DP_ATTENTION conditions - All configurations now accurately reflect the original bash script conditional logic

- Change conc-end from 64 to 32 for dsr1-fp8-b200-trt-mtp ISL=1024/OSL=1024 - Change conc-end from 128 to 64 for dsr1-fp8-b200-trt-mtp ISL=1024/OSL=8192 - Change conc-end from 64 to 32 for dsr1-fp8-b200-trt-mtp ISL=8192/OSL=1024 - Change conc-end from 256 to 128 for dsr1-fp8-h200-trt-mtp ISL=1024/OSL=8192 - Change conc-end from 64 to 32 for dsr1-fp8-h200-trt-mtp ISL=8192/OSL=1024 - All concurrency ranges now align to powers of 2: 4, 8, 16, 32, 64, 128, 256

- Remove overlapping boundaries between conc ranges - Change ranges to avoid overlap: 4-8, 16-64, 128-256 (with gaps at 9-15, 65-127) - All ranges now use powers of 2 boundaries without overlap - Applies to all TP/ISL/OSL combinations in dsr1-fp4-b200-trt-mtp

fix perf-changelog file

gemini-code-assist · 2026-01-12T18:34:10Z

Summary of Changes

Hello @ankursingh-nv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities for DeepSeek-R1 models by integrating Multi-Turn Prediction (MTP) support within the TensorRT-LLM framework. It introduces new configuration profiles and specialized SLURM scripts for B200 and H200 GPUs, enabling performance evaluation across different precisions and operational parameters. The changes also refine the general benchmarking utility to accommodate diverse model requirements, such as optional chat template usage.

Highlights

Multi-Turn Prediction (MTP) Benchmarking: New configurations and dedicated benchmark scripts have been added to support Multi-Turn Prediction (MTP) for DeepSeek-R1 (dsr1) models using TensorRT-LLM.
Expanded GPU Support: MTP benchmarks are now integrated for both B200 and H200 GPUs, covering FP4 and FP8 precisions, with specific optimizations for different sequence lengths and concurrency levels.
Dynamic MoE Backend and MTP Layer Configuration: The new benchmark scripts dynamically determine the MoE backend (CUTLASS or DEEPGEMM) and the number of MTP layers (1 or 3) based on whether Data Parallel Attention (DP_ATTENTION) is enabled.
Benchmark Utility Enhancement: The core run_benchmark_serving utility has been updated to include an optional --use-chat-template flag, providing greater flexibility for various benchmarking scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces MTP (speculative decoding) support for DeepSeek-R1 TRT on NVIDIA hardware. It adds new configurations, benchmark scripts, and updates runner scripts accordingly. The changes are generally well-structured, but there are a few areas for improvement in the new benchmark scripts to enhance robustness and prevent potential race conditions. I've also noted a change in the perf-changelog.yaml that appears to be unrelated to this PR's scope.

I am having trouble creating individual review comments. Click here to see my feedback.

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (35)

The script uses a hardcoded filename dsr1-fp4-mtp.yml for the temporary configuration. If multiple instances of this script run concurrently in the same directory, they could overwrite each other's configuration files, leading to race conditions and incorrect benchmark results. It's safer to use mktemp to create a unique temporary file.

Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT

EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp4-mtp.XXXXXX.yml)

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (35)

The script uses a hardcoded filename dsr1-fp8-mtp.yml. This filename is also used in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. If these scripts were ever to run in the same working directory, they would conflict. To prevent race conditions and ensure correctness, a unique temporary file should be created using mktemp.

Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT

EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp8-mtp.XXXXXX.yml)

benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (35)

The script uses a hardcoded filename dsr1-fp8-mtp.yml. This filename is also used in benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh. If these scripts were ever to run in the same working directory, they would conflict. To prevent race conditions and ensure correctness, a unique temporary file should be created using mktemp.

Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT

EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp8-mtp.XXXXXX.yml)

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (46)

There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.

    enable_block_reuse: false

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (74-83)

The mpirun command is missing the PYTHONNOUSERSITE=1 prefix, which is present in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. This variable prevents user-site packages from being used, which can help avoid environment-related issues and improve reproducibility. For consistency and robustness, it should be added here as well.

PYTHONNOUSERSITE=1 mpirun -n 1 --oversubscribe --allow-run-as-root \
    trtllm-serve $MODEL --port=$PORT \
    --trust_remote_code \
    --backend=pytorch \
    --max_batch_size=$MAX_BATCH_SIZE \
    --max_seq_len=$MAX_MODEL_LEN \
    --max_num_tokens=$MAX_NUM_TOKENS \
    --tp_size=$TP --ep_size=$EP_SIZE \
    --extra_llm_api_options=$EXTRA_CONFIG_FILE \
    > $SERVER_LOG 2>&1 &

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (46)

There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.

    enable_block_reuse: false

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (74-83)

The mpirun command is missing the PYTHONNOUSERSITE=1 prefix, which is present in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. This variable prevents user-site packages from being used, which can help avoid environment-related issues and improve reproducibility. For consistency and robustness, it should be added here as well.

PYTHONNOUSERSITE=1 mpirun -n 1 --oversubscribe --allow-run-as-root \
    trtllm-serve $MODEL --port=$PORT \
    --trust_remote_code \
    --backend=pytorch \
    --max_batch_size=$MAX_BATCH_SIZE \
    --max_seq_len=$MAX_MODEL_LEN \
    --max_num_tokens=$MAX_NUM_TOKENS \
    --tp_size=$TP --ep_size=$EP_SIZE \
    --extra_llm_api_options=$EXTRA_CONFIG_FILE \
    > $SERVER_LOG 2>&1 &

benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (51)

There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.

    enable_block_reuse: false

perf-changelog.yaml (160-164)

This addition to the changelog appears to be unrelated to the current pull request, which is focused on adding MTP support for DSR1 TRT. The changelog entry refers to updating an SGLang image for MI355x and references PR #395. It's generally better to keep pull requests focused on a single logical change. Please consider moving this to a separate PR.

lishicheng1996-nv and others added 21 commits January 7, 2026 10:42

Add --use-chat-template support for MTP benchmarks

92eca1a

- Extended benchmark_lib.sh run_benchmark_serving() to support optional --use-chat-template flag - Added --use-chat-template to all three MTP benchmark scripts - This is required for MTP mode to work correctly

larger h200 concurrency

b5c542b

fix runner

cdebd62

fix typo

835c156

fix h200 runner

f879ca4

fix h200 runner

a5f5ebf

Add MTP support for single-node TRT configs and launch Scripts

92fe872

Add MTP configs to perf-changelog

48f17a7

Merge branch 'main' into kepotdar-shicli-dsr1-trt-mtp-refactor

4b9fc2b

Merge branch 'main' into kepotdar-shicli-dsr1-trt-mtp-refactor

27834e0

fix perf-changelog

71f5d4a

fix H200 config

ba1a206

Merge branch 'main' into kepotdar-shicli-dsr1-trt-mtp-refactor

d937147

fix per-changelog

b1f04df

fix perf-changelog file

no chat template

dae544a

ankursingh-nv requested a review from a team as a code owner January 12, 2026 18:33

github-project-automation Bot added this to InferenceMAX Board Jan 12, 2026

Ankur-singh added 2 commits January 12, 2026 10:34

update perf-changelog

055084f

Merge branch 'main' into dsr1-trt-mtp-no-chat-template

5f34e19

ankursingh-nv added the sweep-enabled label Jan 12, 2026

gemini-code-assist Bot reviewed Jan 12, 2026

View reviewed changes

ankursingh-nv closed this Jan 12, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Jan 12, 2026

ankursingh-nv deleted the dsr1-trt-mtp-no-chat-template branch January 12, 2026 21:22

cquil11 added the NVIDIA label Apr 8, 2026

cquil11 changed the title ~~[WIP] MTP for dsr1 TRT without chat template~~ [NVIDIA] [WIP] MTP for dsr1 TRT without chat template Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] [WIP] MTP for dsr1 TRT without chat template#412

[NVIDIA] [WIP] MTP for dsr1 TRT without chat template#412
ankursingh-nv wants to merge 23 commits intomainfrom
dsr1-trt-mtp-no-chat-template

ankursingh-nv commented Jan 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot commented Jan 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ankursingh-nv commented Jan 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (35)

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (35)

benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (35)

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (46)

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (74-83)

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (46)

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (74-83)

benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (51)

perf-changelog.yaml (160-164)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ankursingh-nv commented Jan 12, 2026 •

edited by cursor Bot

Loading