[Speculative Decoding]Support multi-step mtp with cudagraph #5624

freeliuzc · 2025-12-17T12:57:29Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-17T12:57:37Z

Thanks for your contribution!

Copilot

Pull request overview

This PR adds support for multi-step MTP (Multi-Token Prediction) with CUDAGraph. The changes enable proper capture of CUDA graphs for MTP scenarios by adjusting capture sizes to account for multiple tokens generated per query per step.

Key Changes

Modified CUDA graph capture logic for MTP target model to use dynamic batch size calculation based on speculative token count
Updated _set_cudagraph_sizes to generate capture sizes scaled by tokens per query per step
Simplified target model capture by removing special handling for batch size 1

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`fastdeploy/worker/gpu_model_runner.py`	Simplified MTP target model capture logic, removed batch size 1 skip condition, updated batch size and expected_decode_len calculations
`fastdeploy/config.py`	Added `dec_token_per_query_per_step` parameter to scale CUDA graph capture sizes appropriately for multi-step MTP

codecov-commenter · 2025-12-17T14:27:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@ac73165). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5624   +/-   ##
==========================================
  Coverage           ?   62.88%           
==========================================
  Files              ?      329           
  Lines              ?    41700           
  Branches           ?     6368           
==========================================
  Hits               ?    26223           
  Misses             ?    13492           
  Partials           ?     1985

Flag	Coverage Δ
GPU	`62.88% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gongshaotian

LGTM

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

…5670) * support multi-step mtp with cudagraph * fix usage * fix unit test

…5695) * support multi-step mtp with cudagraph * fix usage * fix unit test

gongshaotian · 2025-12-29T11:54:07Z

fastdeploy/worker/gpu_model_runner.py

                        if batch_size == 1:
                            logger.info("Skip token_num = 1, when capture Draft model for mtp")
                        else:
                            assert batch_size % 2 == 0


assert 删掉

gongshaotian · 2025-12-29T11:55:18Z

fastdeploy/worker/gpu_model_runner.py

                                    if self.scheduler_config.splitwise_role == "decode"
                                    else self.scheduler_config.max_num_batched_tokens
                                ),
                                batch_size=int(batch_size / 2),


batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),

gongshaotian · 2025-12-29T11:59:52Z

fastdeploy/worker/gpu_model_runner.py

                                ),
                                batch_size=int(batch_size / 2),
                                in_capturing=True,
                                expected_decode_len=3,


这里~~以及 _dummy_run() 的退出逻辑~~ 需要改下

1 + draft token + draft model eos token

gongshaotian · 2025-12-29T12:09:30Z

fastdeploy/worker/gpu_model_runner.py

+                    logger.info(
+                        f"Warm up the Target model with the num_tokens:{capture_size}, expected_decode_len:{self.speculative_config.num_speculative_tokens}"
+                    )
                if self.graph_opt_config.draft_model_use_cudagraph:


打开这个启动参数

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

support multi-step mtp with cudagraph

69cb991

Copilot AI review requested due to automatic review settings December 17, 2025 12:57

Copilot started reviewing on behalf of freeliuzc December 17, 2025 12:58 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

yuanlehome previously approved these changes Dec 17, 2025

View reviewed changes

fix usage

1fbb908

freeliuzc dismissed yuanlehome’s stale review via 1fbb908 December 18, 2025 03:58

fix unit test

97cd427

gongshaotian approved these changes Dec 22, 2025

View reviewed changes

yuanlehome approved these changes Dec 22, 2025

View reviewed changes

gongshaotian merged commit 6eada49 into PaddlePaddle:develop Dec 22, 2025
15 of 18 checks passed

freeliuzc added a commit to freeliuzc/FastDeploy that referenced this pull request Dec 22, 2025

[Speculative Decoding]Support multi-step mtp with cudagraph (PaddlePa…

0c9d6e3

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

Copilot AI mentioned this pull request Dec 22, 2025

[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5670

Merged

5 tasks

freeliuzc added a commit to freeliuzc/FastDeploy that referenced this pull request Dec 23, 2025

[Speculative Decoding]Support multi-step mtp with cudagraph (PaddlePa…

ab1b4fe

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

Copilot AI mentioned this pull request Dec 23, 2025

[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695

Merged

5 tasks

freeliuzc added a commit that referenced this pull request Dec 23, 2025

[Speculative Decoding]Support multi-step mtp with cudagraph (#5624) (#…

ceafd75

…5670) * support multi-step mtp with cudagraph * fix usage * fix unit test

qingqing01 pushed a commit that referenced this pull request Dec 23, 2025

[Speculative Decoding]Support multi-step mtp with cudagraph (#5624) (#…

52280be

…5695) * support multi-step mtp with cudagraph * fix usage * fix unit test

gongshaotian reviewed Dec 29, 2025

View reviewed changes

ckl117 pushed a commit to fxyfxy777/FastDeploy that referenced this pull request Dec 29, 2025

[Speculative Decoding]Support multi-step mtp with cudagraph (PaddlePa…

a95ab4b

…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding]Support multi-step mtp with cudagraph #5624

[Speculative Decoding]Support multi-step mtp with cudagraph #5624

freeliuzc commented Dec 17, 2025

Uh oh!

paddle-bot bot commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Dec 17, 2025 •

edited

Loading

Uh oh!

gongshaotian left a comment

Uh oh!

Uh oh!

gongshaotian Dec 29, 2025

Uh oh!

gongshaotian Dec 29, 2025

Uh oh!

gongshaotian Dec 29, 2025 •

edited

Loading

Uh oh!

gongshaotian Dec 29, 2025

Uh oh!

gongshaotian Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Speculative Decoding]Support multi-step mtp with cudagraph #5624

[Speculative Decoding]Support multi-step mtp with cudagraph #5624

Conversation

freeliuzc commented Dec 17, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

codecov-commenter commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gongshaotian Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gongshaotian Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Dec 17, 2025 •

edited

Loading

gongshaotian Dec 29, 2025 •

edited

Loading