[NVIDIA] [WIP] MTP for dsr1 TRT without chat template#412
[NVIDIA] [WIP] MTP for dsr1 TRT without chat template#412ankursingh-nv wants to merge 23 commits intomainfrom
Conversation
- Add dsr1_fp4_b200_trt_mtp_slurm.sh with MTP support - Add dsr1_fp8_b200_trt_mtp_slurm.sh with MTP support - Add dsr1_fp8_h200_trt_mtp_slurm.sh with MTP support - Refactored to use benchmark_lib.sh utilities - Use wait_for_server_ready and run_benchmark_serving functions
- Extended benchmark_lib.sh run_benchmark_serving() to support optional --use-chat-template flag - Added --use-chat-template to all three MTP benchmark scripts - This is required for MTP mode to work correctly
- Add dsr1-fp4-b200-trt-mtp configuration with EP/DP_ATTN/MTP logic - Add dsr1-fp8-b200-trt-mtp configuration with EP/DP_ATTN/MTP logic - Add dsr1-fp8-h200-trt-mtp configuration with EP/DP_ATTN/MTP logic - Configurations align with benchmark script logic for dynamic EP_SIZE, MOE_BACKEND, and MTP values
…vars - Remove duplicate EP_SIZE/DP_ATTENTION calculation logic from MTP scripts - MTP scripts now receive EP_SIZE and DP_ATTENTION as env vars from YAML config (like non-MTP scripts) - Only calculate MOE_BACKEND and MTP values based on DP_ATTENTION flag - Simplifies scripts from 156/117/112 lines to 104 lines each - Eliminates redundant logic between YAML configs and bash scripts
- Fix dsr1-fp4-b200-trt-mtp conc ranges to match EP_SIZE conditions - Fix dsr1-fp8-b200-trt-mtp conc ranges to match DP_ATTENTION conditions - Fix dsr1-fp8-h200-trt-mtp conc ranges to match DP_ATTENTION conditions - All configurations now accurately reflect the original bash script conditional logic
- Change conc-end from 64 to 32 for dsr1-fp8-b200-trt-mtp ISL=1024/OSL=1024 - Change conc-end from 128 to 64 for dsr1-fp8-b200-trt-mtp ISL=1024/OSL=8192 - Change conc-end from 64 to 32 for dsr1-fp8-b200-trt-mtp ISL=8192/OSL=1024 - Change conc-end from 256 to 128 for dsr1-fp8-h200-trt-mtp ISL=1024/OSL=8192 - Change conc-end from 64 to 32 for dsr1-fp8-h200-trt-mtp ISL=8192/OSL=1024 - All concurrency ranges now align to powers of 2: 4, 8, 16, 32, 64, 128, 256
- Remove overlapping boundaries between conc ranges - Change ranges to avoid overlap: 4-8, 16-64, 128-256 (with gaps at 9-15, 65-127) - All ranges now use powers of 2 boundaries without overlap - Applies to all TP/ISL/OSL combinations in dsr1-fp4-b200-trt-mtp
fix perf-changelog file
Summary of ChangesHello @ankursingh-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking capabilities for DeepSeek-R1 models by integrating Multi-Turn Prediction (MTP) support within the TensorRT-LLM framework. It introduces new configuration profiles and specialized SLURM scripts for B200 and H200 GPUs, enabling performance evaluation across different precisions and operational parameters. The changes also refine the general benchmarking utility to accommodate diverse model requirements, such as optional chat template usage. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces MTP (speculative decoding) support for DeepSeek-R1 TRT on NVIDIA hardware. It adds new configurations, benchmark scripts, and updates runner scripts accordingly. The changes are generally well-structured, but there are a few areas for improvement in the new benchmark scripts to enhance robustness and prevent potential race conditions. I've also noted a change in the perf-changelog.yaml that appears to be unrelated to this PR's scope.
I am having trouble creating individual review comments. Click here to see my feedback.
benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (35)
The script uses a hardcoded filename dsr1-fp4-mtp.yml for the temporary configuration. If multiple instances of this script run concurrently in the same directory, they could overwrite each other's configuration files, leading to race conditions and incorrect benchmark results. It's safer to use mktemp to create a unique temporary file.
Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT
EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp4-mtp.XXXXXX.yml)
benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (35)
The script uses a hardcoded filename dsr1-fp8-mtp.yml. This filename is also used in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. If these scripts were ever to run in the same working directory, they would conflict. To prevent race conditions and ensure correctness, a unique temporary file should be created using mktemp.
Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT
EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp8-mtp.XXXXXX.yml)
benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (35)
The script uses a hardcoded filename dsr1-fp8-mtp.yml. This filename is also used in benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh. If these scripts were ever to run in the same working directory, they would conflict. To prevent race conditions and ensure correctness, a unique temporary file should be created using mktemp.
Consider also adding a trap at the beginning of the script to ensure the temporary file is cleaned up on exit:
trap 'rm -f "$EXTRA_CONFIG_FILE"' EXIT
EXTRA_CONFIG_FILE=$(mktemp /tmp/dsr1-fp8-mtp.XXXXXX.yml)
benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (46)
There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.
enable_block_reuse: false
benchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh (74-83)
The mpirun command is missing the PYTHONNOUSERSITE=1 prefix, which is present in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. This variable prevents user-site packages from being used, which can help avoid environment-related issues and improve reproducibility. For consistency and robustness, it should be added here as well.
PYTHONNOUSERSITE=1 mpirun -n 1 --oversubscribe --allow-run-as-root \
trtllm-serve $MODEL --port=$PORT \
--trust_remote_code \
--backend=pytorch \
--max_batch_size=$MAX_BATCH_SIZE \
--max_seq_len=$MAX_MODEL_LEN \
--max_num_tokens=$MAX_NUM_TOKENS \
--tp_size=$TP --ep_size=$EP_SIZE \
--extra_llm_api_options=$EXTRA_CONFIG_FILE \
> $SERVER_LOG 2>&1 &
benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (46)
There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.
enable_block_reuse: false
benchmarks/dsr1_fp8_b200_trt_mtp_slurm.sh (74-83)
The mpirun command is missing the PYTHONNOUSERSITE=1 prefix, which is present in benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh. This variable prevents user-site packages from being used, which can help avoid environment-related issues and improve reproducibility. For consistency and robustness, it should be added here as well.
PYTHONNOUSERSITE=1 mpirun -n 1 --oversubscribe --allow-run-as-root \
trtllm-serve $MODEL --port=$PORT \
--trust_remote_code \
--backend=pytorch \
--max_batch_size=$MAX_BATCH_SIZE \
--max_seq_len=$MAX_MODEL_LEN \
--max_num_tokens=$MAX_NUM_TOKENS \
--tp_size=$TP --ep_size=$EP_SIZE \
--extra_llm_api_options=$EXTRA_CONFIG_FILE \
> $SERVER_LOG 2>&1 &
benchmarks/dsr1_fp8_h200_trt_mtp_slurm.sh (51)
There is a trailing space on this line. While it may not affect YAML parsing for a boolean value, it's best to remove it for cleanliness and to avoid potential issues with stricter parsers.
enable_block_reuse: false
perf-changelog.yaml (160-164)
This addition to the changelog appears to be unrelated to the current pull request, which is focused on adding MTP support for DSR1 TRT. The changelog entry refers to updating an SGLang image for MI355x and references PR #395. It's generally better to keep pull requests focused on a single logical change. Please consider moving this to a separate PR.
Note
Introduces MTP-enabled TRT configs and end-to-end launch support for DeepSeek-R1 on NVIDIA single-node.
dsr1-fp4-b200-trt-mtp,dsr1-fp8-b200-trt-mtp,dsr1-fp8-h200-trt-mtpwithspec-decoding: mtp, updated TP/EP/DP-attn search spaces, and notes for MTP sizingbenchmarks/dsr1_fp4_b200_trt_mtp_slurm.sh,dsr1_fp8_b200_trt_mtp_slurm.sh,dsr1_fp8_h200_trt_mtp_slurm.shconfiguring MOE backend, DP attention, and MTP server optionsrunners/launch_*) detectSPEC_DECODING=mtpand select_mtpbenchmark scripts via suffixbenchmarks/benchmark_lib.sh: add optional--use-chat-templateflag and refactor to build the benchmark command arrayperf-changelog.yamlto record new MTP TRT single-node configurationsWritten by Cursor Bugbot for commit 5f34e19. This will update automatically on new commits. Configure here.