-
Notifications
You must be signed in to change notification settings - Fork 693
[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI] Support multi-step mtp with cudagraph (#5624) #5670
Conversation
…ddle#5624) * support multi-step mtp with cudagraph * fix usage * fix unit test
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR is a cherry-pick from #5624 that adds support for multi-step MTP (Multi-Token Prediction) with CUDA graph optimization. The changes modify how CUDA graph capture sizes are calculated and how the target model is warmed up during the capture process to support multi-step speculative decoding with MTP.
Key Changes:
- Modified CUDA graph capture size calculation to account for multiple tokens per query per step in MTP scenarios
- Updated target model warm-up logic to use dynamic batch size calculation based on speculative tokens
- Removed the skip condition for batch_size=1 in target model capture
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/config.py | Added dec_token_per_query_per_step parameter to _set_cudagraph_sizes method to multiply capture sizes by the number of tokens per step, and updated initialization logic to compute this value based on speculative configuration |
| fastdeploy/worker/gpu_model_runner.py | Simplified target model capture logic by removing batch_size=1 skip condition and updating batch size calculation to divide by (num_speculative_tokens + 1) instead of hardcoded division by 2 |
| self.graph_opt_config._set_cudagraph_sizes(max_capture_size=max_capture_shape) | ||
| dec_token_per_query_per_step = ( | ||
| self.speculative_config.num_speculative_tokens + 1 | ||
| if self.speculative_config is not None and self.speculative_config.method is not None |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition self.speculative_config.method is not None is too broad and should be more specific. This will cause dec_token_per_query_per_step to be calculated for any speculative method, not just "mtp". The logic should check if self.speculative_config.method == "mtp" to align with the specific MTP requirements in lines 1571-1574 above.
| if self.speculative_config is not None and self.speculative_config.method is not None | |
| if self.speculative_config is not None and self.speculative_config.method == "mtp" |
| self._dummy_run( | ||
| num_tokens=( | ||
| self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1) | ||
| if self.scheduler_config.splitwise_role == "decode" | ||
| else self.scheduler_config.max_num_batched_tokens | ||
| ), | ||
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The division operation int(capture_size / (self.speculative_config.num_speculative_tokens + 1)) can result in truncation if capture_size is not evenly divisible. This could lead to incorrect batch sizes being used for warming up the target model. Consider adding a validation check or assertion to ensure that capture_size is divisible by (num_speculative_tokens + 1), similar to the assertion on line 1575 in config.py that validates the max_capture_shape is even.
| self._dummy_run( | |
| num_tokens=( | |
| self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1) | |
| if self.scheduler_config.splitwise_role == "decode" | |
| else self.scheduler_config.max_num_batched_tokens | |
| ), | |
| batch_size=int(capture_size / (self.speculative_config.num_speculative_tokens + 1)), | |
| group_size = self.speculative_config.num_speculative_tokens + 1 | |
| assert ( | |
| capture_size % group_size == 0 | |
| ), f"cudagraph capture_size {capture_size} must be divisible by (num_speculative_tokens + 1) = {group_size}" | |
| self._dummy_run( | |
| num_tokens=( | |
| self.scheduler_config.max_num_seqs * (self.speculative_config.num_speculative_tokens + 1) | |
| if self.scheduler_config.splitwise_role == "decode" | |
| else self.scheduler_config.max_num_batched_tokens | |
| ), | |
| batch_size=capture_size // group_size, |
| Calculate a series of candidate capture sizes, | ||
| and then extract a portion of them as the capture list for the CUDA graph based on user input. |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for this method should be updated to describe the new dec_token_per_query_per_step parameter. The docstring currently doesn't mention this parameter or explain its purpose in the capture size calculations.
| Calculate a series of candidate capture sizes, | |
| and then extract a portion of them as the capture list for the CUDA graph based on user input. | |
| Calculate a series of candidate CUDA graph capture sizes and | |
| extract a subset of them as the final capture list based on user input. | |
| The method builds a base sequence of capture sizes (e.g. 1, 2, 4, 8, 16, …) | |
| and multiplies every element by ``dec_token_per_query_per_step`` so that all | |
| capture sizes correspond to a specific number of decoding tokens per query | |
| per step. | |
| Args: | |
| max_capture_size (int): The maximum capture size to consider. This value | |
| is always included in the candidate list. | |
| dec_token_per_query_per_step (int): Number of decoding tokens per query | |
| per step. All candidate capture sizes are scaled by this factor, so | |
| each capture size is an integer multiple of this value. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #5670 +/- ##
==============================================
Coverage ? 58.95%
==============================================
Files ? 327
Lines ? 40765
Branches ? 6200
==============================================
Hits ? 24031
Misses ? 14860
Partials ? 1874
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
support multi-step mtp with cudagraph
fix usage
fix unit test
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.