[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695

freeliuzc · 2025-12-23T02:59:57Z

support multi-step mtp with cudagraph
fix usage
fix unit test

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

paddle-bot · 2025-12-23T03:00:38Z

Thanks for your contribution!

Copilot

Pull request overview

此PR是从#5624 cherry-pick的变更，为MTP（Medusa Tree Prediction）方法的CUDA Graph捕获添加了多步支持。主要改进了目标模型的捕获逻辑，并在配置层面支持可配置的每步解码token数量。

主要变更：

简化了MTP目标模型的CUDA Graph捕获逻辑，移除了batch_size == 1的特殊处理
在GraphOptimizationConfig中添加了dec_token_per_query_per_step参数，使CUDA Graph捕获大小能够适配推测解码场景
更新了batch_size计算方式，基于capture_size和num_speculative_tokens动态计算

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
fastdeploy/worker/gpu_model_runner.py	重构了MTP场景下目标模型的CUDA Graph捕获逻辑，更新了batch_size和expected_decode_len的计算方式
fastdeploy/config.py	为_set_cudagraph_sizes方法添加了dec_token_per_query_per_step参数，使捕获大小计算能够考虑推测解码的token数量

Copilot · 2025-12-23T03:02:44Z

fastdeploy/config.py

        Calculate a series of candidate capture sizes,
        and then extract a portion of them as the capture list for the CUDA graph based on user input.


The new parameter 'dec_token_per_query_per_step' lacks documentation in the docstring. Consider adding a parameter description explaining its purpose, especially how it relates to speculative decoding and its effect on capture size calculations.

Suggested change

Calculate a series of candidate capture sizes,

and then extract a portion of them as the capture list for the CUDA graph based on user input.

Calculate a series of candidate capture sizes and then extract a portion of

them as the capture list for the CUDA graph based on user input.

Args:

max_capture_size (int): The maximum batch size (in tokens) allowed for

CUDA graph capture. A candidate capture size equal to this value is

always included in the list.

dec_token_per_query_per_step (int): The number of tokens decoded per

query in each decoding step. When speculative decoding is enabled,

this typically corresponds to the draft tokens generated per step.

All candidate capture sizes are scaled by this factor so that the

CUDA graph captures match the effective per-step token count.

Copilot · 2025-12-23T03:02:45Z

fastdeploy/worker/gpu_model_runner.py

+                for capture_size in sorted(capture_sizes, reverse=True):
+                    self._dummy_run(
+                        num_tokens=(
+                            self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1)
+                            if self.scheduler_config.splitwise_role == "decode"
+                            else self.scheduler_config.max_num_batched_tokens
+                        ),
+                        batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
+                        in_capturing=True,
+                        expected_decode_len=self.speculative_config.num_speculative_tokens,
+                        accept_all_drafts=True,


The repeated expression '(self.speculative_config.num_speculative_tokens + 1)' appears multiple times in the code. Consider extracting this into a descriptive variable (e.g., 'tokens_per_step' or 'dec_token_per_query_per_step') at the beginning of the method for better readability and maintainability.

Copilot · 2025-12-23T03:02:45Z

fastdeploy/config.py

+            dec_token_per_query_per_step = (
+                self.speculative_config.num_speculative_tokens + 1
+                if self.speculative_config is not None and self.speculative_config.method is not None
+                else 1


The condition 'self.speculative_config.method is not None' should be checked before accessing self.speculative_config.num_speculative_tokens to avoid potential AttributeError. The current logic checks method after num_speculative_tokens access. Consider restructuring to: 'self.speculative_config is not None and self.speculative_config.method is not None and self.speculative_config.method in ["mtp"]' for proper validation.

Copilot · 2025-12-23T03:02:45Z

fastdeploy/worker/gpu_model_runner.py

+                    self._dummy_run(
+                        num_tokens=(
+                            self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1)
+                            if self.scheduler_config.splitwise_role == "decode"
+                            else self.scheduler_config.max_num_batched_tokens
+                        ),
+                        batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),


The division by (num_speculative_tokens + 1) in the batch_size calculation may result in a remainder, leading to loss of precision when converting to int. The code should validate that capture_size is evenly divisible by (num_speculative_tokens + 1), or handle cases where it's not. Consider adding an assertion or filtering logic similar to the even number check used in the draft model capture.

Suggested change

self._dummy_run(

num_tokens=(

self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1)

if self.scheduler_config.splitwise_role == "decode"

else self.scheduler_config.max_num_batched_tokens

),

batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),

tokens_group = self.speculative_config.num_speculative_tokens + 1

assert (

capture_size % tokens_group == 0

), f"capture_size ({capture_size}) must be divisible by num_speculative_tokens + 1 ({tokens_group})"

batch_size = capture_size // tokens_group

self._dummy_run(

num_tokens=(

self.scheduler_config.max_num_seqs * tokens_group

if self.scheduler_config.splitwise_role == "decode"

else self.scheduler_config.max_num_batched_tokens

),

batch_size=batch_size,

codecov-commenter · 2025-12-23T04:28:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/online/20251131@e65000a). Learn more about missing BASE report.

Additional details and impacted files

@@                    Coverage Diff                     @@
##             release/online/20251131    #5695   +/-   ##
==========================================================
  Coverage                           ?   59.11%           
==========================================================
  Files                              ?      319           
  Lines                              ?    39083           
  Branches                           ?     5884           
==========================================================
  Hits                               ?    23102           
  Misses                             ?    14131           
  Partials                           ?     1850

Flag	Coverage Δ
GPU	`59.11% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[Speculative Decoding]Support multi-step mtp with cudagraph (PaddlePa…

ab1b4fe

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

Copilot AI review requested due to automatic review settings December 23, 2025 02:59

Copilot started reviewing on behalf of freeliuzc December 23, 2025 03:00 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

yuanlehome approved these changes Dec 23, 2025

View reviewed changes

qingqing01 approved these changes Dec 23, 2025

View reviewed changes

qingqing01 merged commit 52280be into PaddlePaddle:release/online/20251131 Dec 23, 2025
18 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695

[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695

Uh oh!

freeliuzc commented Dec 23, 2025

Uh oh!

paddle-bot bot commented Dec 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

codecov-commenter commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		Calculate a series of candidate capture sizes,
		and then extract a portion of them as the capture list for the CUDA graph based on user input.

-        Calculate a series of candidate capture sizes,
-        and then extract a portion of them as the capture list for the CUDA graph based on user input.
+        Calculate a series of candidate capture sizes and then extract a portion of
+        them as the capture list for the CUDA graph based on user input.
+        Args:
+            max_capture_size (int): The maximum batch size (in tokens) allowed for
+                CUDA graph capture. A candidate capture size equal to this value is
+                always included in the list.
+            dec_token_per_query_per_step (int): The number of tokens decoded per
+                query in each decoding step. When speculative decoding is enabled,
+                this typically corresponds to the draft tokens generated per step.
+                All candidate capture sizes are scaled by this factor so that the
+                CUDA graph captures match the effective per-step token count.

[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695

[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695

Uh oh!

Conversation

freeliuzc commented Dec 23, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 23, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants