-
Notifications
You must be signed in to change notification settings - Fork 690
[Speculative Decoding] split draft_tokens into standalone post-processing path #5205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative Decoding] split draft_tokens into standalone post-processing path #5205
Conversation
…h for MTP + logprobs
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5205 +/- ##
==========================================
Coverage ? 59.07%
==========================================
Files ? 317
Lines ? 38799
Branches ? 5846
==========================================
Hits ? 22920
Misses ? 14060
Partials ? 1819
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d16f333 to
fe247e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the speculative decoding pipeline by extracting draft token processing into a dedicated method _process_batch_draft_tokens. The main goal is to improve code clarity and enable proper logprobs generation for draft tokens in Multi-Token Prediction (MTP) workflows.
- Introduced
_process_batch_draft_tokensto handle mtype==4 (draft tokens) separately from mtype==3 (target tokens) - Updated the OpenAI serving layer to collect and return draft_logprobs alongside regular logprobs
- Improved
Request.__repr__to provide meaningful debug output - Added comprehensive unit tests for the new draft token processing logic
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
fastdeploy/output/token_processor.py |
Added _process_batch_draft_tokens method and refactored _process_batch_output to handle draft tokens separately via early return |
fastdeploy/entrypoints/openai/serving_chat.py |
Extended chat completion handlers to collect and return draft_logprobs in responses |
fastdeploy/engine/request.py |
Enhanced Request.__repr__ to show request_id by default and full details in debug mode |
tests/output/test_process_batch_draft_tokens.py |
New test file with comprehensive coverage for draft token processing including edge cases |
tests/output/test_process_batch_output_use_zmq.py |
Added copyright header |
Comments suppressed due to low confidence (1)
fastdeploy/output/token_processor.py:943
- This method requires 2 positional arguments, whereas overridden TokenProcessor.postprocess may be called with 3. This call correctly calls the base method, but does not match the signature of the overriding method.
def postprocess(self, batch_result):
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| Process batch draft tokens and generate corresponding request outputs | ||
|
|
||
| Args: | ||
| mtype (int): Message type (3=target token, 4=draft token) |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states "Message type (3=target token, 4=draft token)" but this could be clearer. Consider expanding this to explain when each type is used and what the difference is between target and draft tokens in the context of speculative decoding. For example:
mtype (int): Message type indicating token processing mode
- 3: Target tokens (verified tokens from the main model)
- 4: Draft tokens (speculative tokens for logprobs collection)| mtype (int): Message type (3=target token, 4=draft token) | |
| mtype (int): Message type indicating token processing mode | |
| - 3: Target tokens (verified tokens from the main model) | |
| - 4: Draft tokens (speculative tokens for logprobs collection) | |
| In speculative decoding, draft tokens are generated for logprobs collection and may be accepted or rejected by the main model, while target tokens are verified outputs from the main model. |
| metrics=None, | ||
| ) | ||
|
|
||
| token_ids = tokens[i][:, 0].tolist()[: accept_num[i]] |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting the token slicing logic to make it more readable:
# Extract accepted tokens for this batch item
max_accepted = accept_num[i]
token_ids = tokens[i][:max_accepted, 0].tolist()This makes it clearer that you're getting the first column (the sampled tokens) and limiting to the accepted count.
| token_ids = tokens[i][:, 0].tolist()[: accept_num[i]] | |
| # Extract accepted tokens for this batch item | |
| max_accepted = accept_num[i] | |
| token_ids = tokens[i][:max_accepted, 0].tolist() |
| task.eos_token_ids = [2] | ||
|
|
||
| def test_process_batch_draft_tokens_normal_case(self): | ||
| """测试正常情况下的target处理""" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "测试正常情况下的target处理" (Testing normal case for target processing) is misleading. This test is for draft token processing (mtype=4), not target token processing (mtype=3). The comment should be updated to reflect that this is testing draft token processing:
"""测试正常情况下的draft处理""" # Testing normal case for draft processing| """测试正常情况下的target处理""" | |
| """测试正常情况下的draft处理""" |
Motivation
This PR refactors the speculative decoding pipeline by extracting
draft_tokensinto a standalone post-processing branch.The goal is to improve the clarity of the MTP (Multi-Token Prediction) workflow and provide a cleaner and more extensible structure for logprobs generation under draft mode.
This also prepares the codebase for future enhancements in speculative decoding.
Modifications
draft_tokens.Usage or Command
Although this PR does not introduce new public APIs, it affects the behavior of draft-token logprobs in speculative decoding.
Below are example requests demonstrating how draft logprobs can be enabled through existing REST interfaces.
/v1/chat/completions Example
/v1/completions Example
These examples demonstrate how to enable:
logprobstop_logprobsinclude_draft_logprobsAccuracy Tests
This PR does not affect model forward computation or kernel behavior; it only refactors post-processing logic.
The generated outputs for both:
remain consistent with previous behavior.
Unit tests were updated and extended to ensure correctness of:
accept_num)Checklist
[Speculative Decoding]pre-commitbefore commit.(Not required if behavior is unchanged, but leave unchecked per template.)
releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.