-
Notifications
You must be signed in to change notification settings - Fork 693
[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5695
Conversation
…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
此PR是从#5624 cherry-pick的变更,为MTP(Medusa Tree Prediction)方法的CUDA Graph捕获添加了多步支持。主要改进了目标模型的捕获逻辑,并在配置层面支持可配置的每步解码token数量。
主要变更:
- 简化了MTP目标模型的CUDA Graph捕获逻辑,移除了batch_size == 1的特殊处理
- 在GraphOptimizationConfig中添加了dec_token_per_query_per_step参数,使CUDA Graph捕获大小能够适配推测解码场景
- 更新了batch_size计算方式,基于capture_size和num_speculative_tokens动态计算
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fastdeploy/worker/gpu_model_runner.py | 重构了MTP场景下目标模型的CUDA Graph捕获逻辑,更新了batch_size和expected_decode_len的计算方式 |
| fastdeploy/config.py | 为_set_cudagraph_sizes方法添加了dec_token_per_query_per_step参数,使捕获大小计算能够考虑推测解码的token数量 |
| Calculate a series of candidate capture sizes, | ||
| and then extract a portion of them as the capture list for the CUDA graph based on user input. |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new parameter 'dec_token_per_query_per_step' lacks documentation in the docstring. Consider adding a parameter description explaining its purpose, especially how it relates to speculative decoding and its effect on capture size calculations.
| Calculate a series of candidate capture sizes, | |
| and then extract a portion of them as the capture list for the CUDA graph based on user input. | |
| Calculate a series of candidate capture sizes and then extract a portion of | |
| them as the capture list for the CUDA graph based on user input. | |
| Args: | |
| max_capture_size (int): The maximum batch size (in tokens) allowed for | |
| CUDA graph capture. A candidate capture size equal to this value is | |
| always included in the list. | |
| dec_token_per_query_per_step (int): The number of tokens decoded per | |
| query in each decoding step. When speculative decoding is enabled, | |
| this typically corresponds to the draft tokens generated per step. | |
| All candidate capture sizes are scaled by this factor so that the | |
| CUDA graph captures match the effective per-step token count. |
| for capture_size in sorted(capture_sizes, reverse=True): | ||
| self._dummy_run( | ||
| num_tokens=( | ||
| self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1) | ||
| if self.scheduler_config.splitwise_role == "decode" | ||
| else self.scheduler_config.max_num_batched_tokens | ||
| ), | ||
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), | ||
| in_capturing=True, | ||
| expected_decode_len=self.speculative_config.num_speculative_tokens, | ||
| accept_all_drafts=True, |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated expression '(self.speculative_config.num_speculative_tokens + 1)' appears multiple times in the code. Consider extracting this into a descriptive variable (e.g., 'tokens_per_step' or 'dec_token_per_query_per_step') at the beginning of the method for better readability and maintainability.
| dec_token_per_query_per_step = ( | ||
| self.speculative_config.num_speculative_tokens + 1 | ||
| if self.speculative_config is not None and self.speculative_config.method is not None | ||
| else 1 |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition 'self.speculative_config.method is not None' should be checked before accessing self.speculative_config.num_speculative_tokens to avoid potential AttributeError. The current logic checks method after num_speculative_tokens access. Consider restructuring to: 'self.speculative_config is not None and self.speculative_config.method is not None and self.speculative_config.method in ["mtp"]' for proper validation.
| self._dummy_run( | ||
| num_tokens=( | ||
| self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1) | ||
| if self.scheduler_config.splitwise_role == "decode" | ||
| else self.scheduler_config.max_num_batched_tokens | ||
| ), | ||
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The division by (num_speculative_tokens + 1) in the batch_size calculation may result in a remainder, leading to loss of precision when converting to int. The code should validate that capture_size is evenly divisible by (num_speculative_tokens + 1), or handle cases where it's not. Consider adding an assertion or filtering logic similar to the even number check used in the draft model capture.
| self._dummy_run( | |
| num_tokens=( | |
| self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1) | |
| if self.scheduler_config.splitwise_role == "decode" | |
| else self.scheduler_config.max_num_batched_tokens | |
| ), | |
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), | |
| tokens_group = self.speculative_config.num_speculative_tokens + 1 | |
| assert ( | |
| capture_size % tokens_group == 0 | |
| ), f"capture_size ({capture_size}) must be divisible by num_speculative_tokens + 1 ({tokens_group})" | |
| batch_size = capture_size // tokens_group | |
| self._dummy_run( | |
| num_tokens=( | |
| self.scheduler_config.max_num_seqs * tokens_group | |
| if self.scheduler_config.splitwise_role == "decode" | |
| else self.scheduler_config.max_num_batched_tokens | |
| ), | |
| batch_size=batch_size, |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/online/20251131 #5695 +/- ##
==========================================================
Coverage ? 59.11%
==========================================================
Files ? 319
Lines ? 39083
Branches ? 5884
==========================================================
Hits ? 23102
Misses ? 14131
Partials ? 1850
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
52280be
into
PaddlePaddle:release/online/20251131
support multi-step mtp with cudagraph
fix usage
fix unit test
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.