Skip to content

[NVIDIA] [WIP] MTP for dsr1 TRT without chat template#412

Closed
ankursingh-nv wants to merge 23 commits intomainfrom
dsr1-trt-mtp-no-chat-template
Closed

[NVIDIA] [WIP] MTP for dsr1 TRT without chat template#412
ankursingh-nv wants to merge 23 commits intomainfrom
dsr1-trt-mtp-no-chat-template

Conversation

@ankursingh-nv
Copy link
Copy Markdown
Contributor

@ankursingh-nv ankursingh-nv commented Jan 12, 2026

Note

Introduces MTP-enabled TRT configs and end-to-end launch support for DeepSeek-R1 on NVIDIA single-node.

  • Adds dsr1-fp4-b200-trt-mtp, dsr1-fp8-b200-trt-mtp, dsr1-fp8-h200-trt-mtp with spec-decoding: mtp, updated TP/EP/DP-attn search spaces, and notes for MTP sizing
  • New benchmark scripts: benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh, dsr1_fp8_b200_trt_mtp_slurm.sh, dsr1_fp8_h200_trt_mtp_slurm.sh configuring MOE backend, DP attention, and MTP server options
  • Runner scripts (runners/launch_*) detect SPEC_DECODING=mtp and select _mtp benchmark scripts via suffix
  • benchmarks/benchmark_lib.sh: add optional --use-chat-template flag and refactor to build the benchmark command array
  • Update perf-changelog.yaml to record new MTP TRT single-node configurations

Written by Cursor Bugbot for commit 5f34e19. This will update automatically on new commits. Configure here.

lishicheng1996-nv and others added 21 commits January 7, 2026 10:42
- Add dsr1_fp4_b200_trt_mtp_slurm.sh with MTP support
- Add dsr1_fp8_b200_trt_mtp_slurm.sh with MTP support
- Add dsr1_fp8_h200_trt_mtp_slurm.sh with MTP support
- Refactored to use benchmark_lib.sh utilities
- Use wait_for_server_ready and run_benchmark_serving functions
- Extended benchmark_lib.sh run_benchmark_serving() to support optional --use-chat-template flag
- Added --use-chat-template to all three MTP benchmark scripts
- This is required for MTP mode to work correctly
- Add dsr1-fp4-b200-trt-mtp configuration with EP/DP_ATTN/MTP logic
- Add dsr1-fp8-b200-trt-mtp configuration with EP/DP_ATTN/MTP logic
- Add dsr1-fp8-h200-trt-mtp configuration with EP/DP_ATTN/MTP logic
- Configurations align with benchmark script logic for dynamic EP_SIZE, MOE_BACKEND, and MTP values
…vars

- Remove duplicate EP_SIZE/DP_ATTENTION calculation logic from MTP scripts
- MTP scripts now receive EP_SIZE and DP_ATTENTION as env vars from YAML config (like non-MTP scripts)
- Only calculate MOE_BACKEND and MTP values based on DP_ATTENTION flag
- Simplifies scripts from 156/117/112 lines to 104 lines each
- Eliminates redundant logic between YAML configs and bash scripts
- Fix dsr1-fp4-b200-trt-mtp conc ranges to match EP_SIZE conditions
- Fix dsr1-fp8-b200-trt-mtp conc ranges to match DP_ATTENTION conditions
- Fix dsr1-fp8-h200-trt-mtp conc ranges to match DP_ATTENTION conditions
- All configurations now accurately reflect the original bash script conditional logic
- Change conc-end from 64 to 32 for dsr1-fp8-b200-trt-mtp ISL=1024/OSL=1024
- Change conc-end from 128 to 64 for dsr1-fp8-b200-trt-mtp ISL=1024/OSL=8192
- Change conc-end from 64 to 32 for dsr1-fp8-b200-trt-mtp ISL=8192/OSL=1024
- Change conc-end from 256 to 128 for dsr1-fp8-h200-trt-mtp ISL=1024/OSL=8192
- Change conc-end from 64 to 32 for dsr1-fp8-h200-trt-mtp ISL=8192/OSL=1024
- All concurrency ranges now align to powers of 2: 4, 8, 16, 32, 64, 128, 256
- Remove overlapping boundaries between conc ranges
- Change ranges to avoid overlap: 4-8, 16-64, 128-256 (with gaps at 9-15, 65-127)
- All ranges now use powers of 2 boundaries without overlap
- Applies to all TP/ISL/OSL combinations in dsr1-fp4-b200-trt-mtp
fix perf-changelog file
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ankursingh-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities for DeepSeek-R1 models by integrating Multi-Turn Prediction (MTP) support within the TensorRT-LLM framework. It introduces new configuration profiles and specialized SLURM scripts for B200 and H200 GPUs, enabling performance evaluation across different precisions and operational parameters. The changes also refine the general benchmarking utility to accommodate diverse model requirements, such as optional chat template usage.

Highlights

  • Multi-Turn Prediction (MTP) Benchmarking: New configurations and dedicated benchmark scripts have been added to support Multi-Turn Prediction (MTP) for DeepSeek-R1 (dsr1) models using TensorRT-LLM.
  • Expanded GPU Support: MTP benchmarks are now integrated for both B200 and H200 GPUs, covering FP4 and FP8 precisions, with specific optimizations for different sequence lengths and concurrency levels.
  • Dynamic MoE Backend and MTP Layer Configuration: The new benchmark scripts dynamically determine the MoE backend (CUTLASS or DEEPGEMM) and the number of MTP layers (1 or 3) based on whether Data Parallel Attention (DP_ATTENTION) is enabled.
  • Benchmark Utility Enhancement: The core run_benchmark_serving utility has been updated to include an optional --use-chat-template flag, providing greater flexibility for various benchmarking scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces MTP (speculative decoding) support for DeepSeek-R1 TRT on NVIDIA hardware. It adds new configurations, benchmark scripts, and updates runner scripts accordingly. The changes are generally well-structured, but there are a few areas for improvement in the new benchmark scripts to enhance robustness and prevent potential race conditions. I've also noted a change in the perf-changelog.yaml that appears to be unrelated to this PR's scope.

I am having trouble creating individual review comments. Click here to see my feedback.

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (35)

high

The script uses a hardcoded filename dsr1-fp4-mtp.yml for the temporary configuration. If multiple instances of this script run concurrently in the same directory, they could overwrite each other's configuration files, leading to race conditions and incorrect benchmark results. It's safer to use mktemp to create a unique temporary file.

Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT

EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp4-mtp.XXXXXX.yml)

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (35)

high

The script uses a hardcoded filename dsr1-fp8-mtp.yml. This filename is also used in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. If these scripts were ever to run in the same working directory, they would conflict. To prevent race conditions and ensure correctness, a unique temporary file should be created using mktemp.

Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT

EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp8-mtp.XXXXXX.yml)

benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (35)

high

The script uses a hardcoded filename dsr1-fp8-mtp.yml. This filename is also used in benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh. If these scripts were ever to run in the same working directory, they would conflict. To prevent race conditions and ensure correctness, a unique temporary file should be created using mktemp.

Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT

EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp8-mtp.XXXXXX.yml)

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (46)

medium

There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.

    enable_block_reuse: false

benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (74-83)

medium

The mpirun command is missing the PYTHONNOUSERSITE=1 prefix, which is present in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. This variable prevents user-site packages from being used, which can help avoid environment-related issues and improve reproducibility. For consistency and robustness, it should be added here as well.

PYTHONNOUSERSITE=1 mpirun -n 1 --oversubscribe --allow-run-as-root \
    trtllm-serve $MODEL --port=$PORT \
    --trust_remote_code \
    --backend=pytorch \
    --max_batch_size=$MAX_BATCH_SIZE \
    --max_seq_len=$MAX_MODEL_LEN \
    --max_num_tokens=$MAX_NUM_TOKENS \
    --tp_size=$TP --ep_size=$EP_SIZE \
    --extra_llm_api_options=$EXTRA_CONFIG_FILE \
    > $SERVER_LOG 2>&1 &

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (46)

medium

There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.

    enable_block_reuse: false

benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (74-83)

medium

The mpirun command is missing the PYTHONNOUSERSITE=1 prefix, which is present in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. This variable prevents user-site packages from being used, which can help avoid environment-related issues and improve reproducibility. For consistency and robustness, it should be added here as well.

PYTHONNOUSERSITE=1 mpirun -n 1 --oversubscribe --allow-run-as-root \
    trtllm-serve $MODEL --port=$PORT \
    --trust_remote_code \
    --backend=pytorch \
    --max_batch_size=$MAX_BATCH_SIZE \
    --max_seq_len=$MAX_MODEL_LEN \
    --max_num_tokens=$MAX_NUM_TOKENS \
    --tp_size=$TP --ep_size=$EP_SIZE \
    --extra_llm_api_options=$EXTRA_CONFIG_FILE \
    > $SERVER_LOG 2>&1 &

benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (51)

medium

There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.

    enable_block_reuse: false

perf-changelog.yaml (160-164)

medium

This addition to the changelog appears to be unrelated to the current pull request, which is focused on adding MTP support for DSR1 TRT. The changelog entry refers to updating an SGLang image for MI355x and references PR #395. It's generally better to keep pull requests focused on a single logical change. Please consider moving this to a separate PR.

@ankursingh-nv ankursingh-nv deleted the dsr1-trt-mtp-no-chat-template branch January 12, 2026 21:22
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title [WIP] MTP for dsr1 TRT without chat template [NVIDIA] [WIP] MTP for dsr1 TRT without chat template Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants