Skip to content

Conversation

@freeliuzc
Copy link
Collaborator

  • support multi-step mtp with cudagraph

  • fix usage

  • fix unit test

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…ddle#5624)

* support multi-step mtp with cudagraph

* fix usage

* fix unit test
Copilot AI review requested due to automatic review settings December 23, 2025 02:59
@paddle-bot
Copy link

paddle-bot bot commented Dec 23, 2025

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

此PR是从#5624 cherry-pick的变更,为MTP(Medusa Tree Prediction)方法的CUDA Graph捕获添加了多步支持。主要改进了目标模型的捕获逻辑,并在配置层面支持可配置的每步解码token数量。

主要变更:

  • 简化了MTP目标模型的CUDA Graph捕获逻辑,移除了batch_size == 1的特殊处理
  • 在GraphOptimizationConfig中添加了dec_token_per_query_per_step参数,使CUDA Graph捕获大小能够适配推测解码场景
  • 更新了batch_size计算方式,基于capture_size和num_speculative_tokens动态计算

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
fastdeploy/worker/gpu_model_runner.py 重构了MTP场景下目标模型的CUDA Graph捕获逻辑,更新了batch_size和expected_decode_len的计算方式
fastdeploy/config.py 为_set_cudagraph_sizes方法添加了dec_token_per_query_per_step参数,使捕获大小计算能够考虑推测解码的token数量

Comment on lines 893 to 894
Calculate a series of candidate capture sizes,
and then extract a portion of them as the capture list for the CUDA graph based on user input.
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new parameter 'dec_token_per_query_per_step' lacks documentation in the docstring. Consider adding a parameter description explaining its purpose, especially how it relates to speculative decoding and its effect on capture size calculations.

Suggested change
Calculate a series of candidate capture sizes,
and then extract a portion of them as the capture list for the CUDA graph based on user input.
Calculate a series of candidate capture sizes and then extract a portion of
them as the capture list for the CUDA graph based on user input.
Args:
max_capture_size (int): The maximum batch size (in tokens) allowed for
CUDA graph capture. A candidate capture size equal to this value is
always included in the list.
dec_token_per_query_per_step (int): The number of tokens decoded per
query in each decoding step. When speculative decoding is enabled,
this typically corresponds to the draft tokens generated per step.
All candidate capture sizes are scaled by this factor so that the
CUDA graph captures match the effective per-step token count.

Copilot uses AI. Check for mistakes.
Comment on lines +1944 to +1954
for capture_size in sorted(capture_sizes, reverse=True):
self._dummy_run(
num_tokens=(
self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1)
if self.scheduler_config.splitwise_role == "decode"
else self.scheduler_config.max_num_batched_tokens
),
batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
in_capturing=True,
expected_decode_len=self.speculative_config.num_speculative_tokens,
accept_all_drafts=True,
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated expression '(self.speculative_config.num_speculative_tokens + 1)' appears multiple times in the code. Consider extracting this into a descriptive variable (e.g., 'tokens_per_step' or 'dec_token_per_query_per_step') at the beginning of the method for better readability and maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +1538 to +1541
dec_token_per_query_per_step = (
self.speculative_config.num_speculative_tokens + 1
if self.speculative_config is not None and self.speculative_config.method is not None
else 1
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition 'self.speculative_config.method is not None' should be checked before accessing self.speculative_config.num_speculative_tokens to avoid potential AttributeError. The current logic checks method after num_speculative_tokens access. Consider restructuring to: 'self.speculative_config is not None and self.speculative_config.method is not None and self.speculative_config.method in ["mtp"]' for proper validation.

Copilot uses AI. Check for mistakes.
Comment on lines +1945 to +1951
self._dummy_run(
num_tokens=(
self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1)
if self.scheduler_config.splitwise_role == "decode"
else self.scheduler_config.max_num_batched_tokens
),
batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The division by (num_speculative_tokens + 1) in the batch_size calculation may result in a remainder, leading to loss of precision when converting to int. The code should validate that capture_size is evenly divisible by (num_speculative_tokens + 1), or handle cases where it's not. Consider adding an assertion or filtering logic similar to the even number check used in the draft model capture.

Suggested change
self._dummy_run(
num_tokens=(
self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1)
if self.scheduler_config.splitwise_role == "decode"
else self.scheduler_config.max_num_batched_tokens
),
batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)),
tokens_group = self.speculative_config.num_speculative_tokens + 1
assert (
capture_size % tokens_group == 0
), f"capture_size ({capture_size}) must be divisible by num_speculative_tokens + 1 ({tokens_group})"
batch_size = capture_size // tokens_group
self._dummy_run(
num_tokens=(
self.scheduler_config.max_num_seqs * tokens_group
if self.scheduler_config.splitwise_role == "decode"
else self.scheduler_config.max_num_batched_tokens
),
batch_size=batch_size,

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/online/20251131@e65000a). Learn more about missing BASE report.

Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20251131    #5695   +/-   ##
==========================================================
  Coverage                           ?   59.11%           
==========================================================
  Files                              ?      319           
  Lines                              ?    39083           
  Branches                           ?     5884           
==========================================================
  Hits                               ?    23102           
  Misses                             ?    14131           
  Partials                           ?     1850           
Flag Coverage Δ
GPU 59.11% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@qingqing01 qingqing01 merged commit 52280be into PaddlePaddle:release/online/20251131 Dec 23, 2025
18 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants