[Speculative Decoding] split draft_tokens into standalone post-processing path #5205

sunlei1024 · 2025-11-25T04:50:03Z

Motivation

This PR refactors the speculative decoding pipeline by extracting draft_tokens into a standalone post-processing branch.
The goal is to improve the clarity of the MTP (Multi-Token Prediction) workflow and provide a cleaner and more extensible structure for logprobs generation under draft mode.
This also prepares the codebase for future enhancements in speculative decoding.

Modifications

Added a dedicated post-processing path for draft_tokens.
Refactored the MTP handling logic to separate draft-token flow from standard completion flow.
Updated logprobs computation to properly support draft token outputs.
Improved readability and maintainability of the speculative decoding code.
Adjusted related data structures and RequestOutput construction to align with the new branch.

Usage or Command

Although this PR does not introduce new public APIs, it affects the behavior of draft-token logprobs in speculative decoding.
Below are example requests demonstrating how draft logprobs can be enabled through existing REST interfaces.

/v1/chat/completions Example

curl --location 'http://120.0.0.1:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "你是谁？"
        }
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "include_draft_logprobs": true,
    "stream": true
}'

/v1/completions Example

curl --location 'http://120.0.0.1:8000/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "你好",
    "stream": false,
    "logprobs": 5,
    "include_draft_logprobs": true
}'

These examples demonstrate how to enable:

logprobs
top_logprobs
include_draft_logprobs
streaming / non-streaming outputs

Accuracy Tests

This PR does not affect model forward computation or kernel behavior; it only refactors post-processing logic.
The generated outputs for both:

standard MTP mode, and
draft token mode with logprobs
remain consistent with previous behavior.

Unit tests were updated and extended to ensure correctness of:

draft token post-processing
stop-flag handling
variable acceptance count (accept_num)
different K top-logprobs values

Checklist

Add at least a tag in the PR title.
- Recommended tag: [Speculative Decoding]
Format your code, run pre-commit before commit.
Add unit tests.
Provide accuracy results.
(Not required if behavior is unchanged, but leave unchecked per template.)
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…h for MTP + logprobs

paddle-bot · 2025-11-25T04:50:08Z

Thanks for your contribution!

codecov-commenter · 2025-11-25T06:00:50Z

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cead6b2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/output/token_processor.py	96.87%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5205   +/-   ##
==========================================
  Coverage           ?   59.07%           
==========================================
  Files              ?      317           
  Lines              ?    38799           
  Branches           ?     5846           
==========================================
  Hits               ?    22920           
  Misses             ?    14060           
  Partials           ?     1819

Flag	Coverage Δ
GPU	`59.07% <97.22%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR refactors the speculative decoding pipeline by extracting draft token processing into a dedicated method _process_batch_draft_tokens. The main goal is to improve code clarity and enable proper logprobs generation for draft tokens in Multi-Token Prediction (MTP) workflows.

Introduced _process_batch_draft_tokens to handle mtype==4 (draft tokens) separately from mtype==3 (target tokens)
Updated the OpenAI serving layer to collect and return draft_logprobs alongside regular logprobs
Improved Request.__repr__ to provide meaningful debug output
Added comprehensive unit tests for the new draft token processing logic

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`fastdeploy/output/token_processor.py`	Added `_process_batch_draft_tokens` method and refactored `_process_batch_output` to handle draft tokens separately via early return
`fastdeploy/entrypoints/openai/serving_chat.py`	Extended chat completion handlers to collect and return draft_logprobs in responses
`fastdeploy/engine/request.py`	Enhanced `Request.__repr__` to show request_id by default and full details in debug mode
`tests/output/test_process_batch_draft_tokens.py`	New test file with comprehensive coverage for draft token processing including edge cases
`tests/output/test_process_batch_output_use_zmq.py`	Added copyright header

Comments suppressed due to low confidence (1)

fastdeploy/output/token_processor.py:943

This method requires 2 positional arguments, whereas overridden TokenProcessor.postprocess may be called with 3. This call correctly calls the base method, but does not match the signature of the overriding method.

    def postprocess(self, batch_result):

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copilot · 2025-11-25T12:41:15Z

fastdeploy/output/token_processor.py

+        Process batch draft tokens and generate corresponding request outputs
+
+        Args:
+            mtype (int): Message type (3=target token, 4=draft token)


The docstring states "Message type (3=target token, 4=draft token)" but this could be clearer. Consider expanding this to explain when each type is used and what the difference is between target and draft tokens in the context of speculative decoding. For example:

mtype (int): Message type indicating token processing mode - 3: Target tokens (verified tokens from the main model) - 4: Draft tokens (speculative tokens for logprobs collection)

Suggested change

mtype (int): Message type (3=target token, 4=draft token)

mtype (int): Message type indicating token processing mode

- 3: Target tokens (verified tokens from the main model)

- 4: Draft tokens (speculative tokens for logprobs collection)

In speculative decoding, draft tokens are generated for logprobs collection and may be accepted or rejected by the main model, while target tokens are verified outputs from the main model.

Copilot · 2025-11-25T12:41:15Z

fastdeploy/output/token_processor.py

+                metrics=None,
+            )
+
+            token_ids = tokens[i][:, 0].tolist()[: accept_num[i]]


Consider extracting the token slicing logic to make it more readable:

# Extract accepted tokens for this batch item max_accepted = accept_num[i] token_ids = tokens[i][:max_accepted, 0].tolist()

This makes it clearer that you're getting the first column (the sampled tokens) and limiting to the accepted count.

Suggested change

token_ids = tokens[i][:, 0].tolist()[: accept_num[i]]

# Extract accepted tokens for this batch item

max_accepted = accept_num[i]

token_ids = tokens[i][:max_accepted, 0].tolist()

Copilot · 2025-11-25T12:41:15Z

tests/output/test_process_batch_draft_tokens.py

+            task.eos_token_ids = [2]
+
+    def test_process_batch_draft_tokens_normal_case(self):
+        """测试正常情况下的target处理"""


The comment "测试正常情况下的target处理" (Testing normal case for target processing) is misleading. This test is for draft token processing (mtype=4), not target token processing (mtype=3). The comment should be updated to reflect that this is testing draft token processing:

"""测试正常情况下的draft处理""" # Testing normal case for draft processing

Suggested change

"""测试正常情况下的target处理"""

"""测试正常情况下的draft处理"""

…path(#5205) (#5231) * refactor(mtp): split draft_tokens into standalone post-processing path for MTP + logprobs * Restore Request.__repr__ implementation * ci * add envs * fix unittest

…path(#5205) (#5232) * merge code * fix Request CONFLICT * remove unuse unittest --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>

refactor(mtp): split draft_tokens into standalone post-processing pat…

fe247e5

…h for MTP + logprobs

paddle-bot bot added the contributor External developers label Nov 25, 2025

sunlei1024 force-pushed the refac/mtp-logprobs branch from d16f333 to fe247e5 Compare November 25, 2025 12:17

Restore Request.__repr__ implementation

d51d9d3

Jiang-Jia-Jun requested a review from Copilot November 25, 2025 12:36

Copilot started reviewing on behalf of Jiang-Jia-Jun November 25, 2025 12:36 View session

sunlei1024 changed the title ~~[Speculative Decoding] split draft_tokens into standalone post-processing pat…~~ [Speculative Decoding] split draft_tokens into standalone post-processing path Nov 25, 2025

Copilot finished reviewing on behalf of Jiang-Jia-Jun November 25, 2025 12:40

Copilot AI reviewed Nov 25, 2025

View reviewed changes

sunlei1024 added 3 commits November 26, 2025 03:01

ci

7adc0cf

Merge branch 'develop' into refac/mtp-logprobs

aa94b42

add envs

a5df190

freeliuzc previously approved these changes Nov 26, 2025

View reviewed changes

fix unittest

c1a0bbe

sunlei1024 dismissed freeliuzc’s stale review via c1a0bbe November 27, 2025 01:44

Jiang-Jia-Jun approved these changes Nov 27, 2025

View reviewed changes

Jiang-Jia-Jun merged commit c424e08 into PaddlePaddle:develop Nov 27, 2025
16 of 17 checks passed

sunlei1024 deleted the refac/mtp-logprobs branch December 3, 2025 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding] split draft_tokens into standalone post-processing path #5205

[Speculative Decoding] split draft_tokens into standalone post-processing path #5205

Uh oh!

sunlei1024 commented Nov 25, 2025

Uh oh!

paddle-bot bot commented Nov 25, 2025

Uh oh!

codecov-commenter commented Nov 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-            mtype (int): Message type (3=target token, 4=draft token)
+            mtype (int): Message type indicating token processing mode
+                - 3: Target tokens (verified tokens from the main model)
+                - 4: Draft tokens (speculative tokens for logprobs collection)
+                In speculative decoding, draft tokens are generated for logprobs collection and may be accepted or rejected by the main model, while target tokens are verified outputs from the main model.

-            token_ids = tokens[i][:, 0].tolist()[: accept_num[i]]
+            # Extract accepted tokens for this batch item
+            max_accepted = accept_num[i]
+            token_ids = tokens[i][:max_accepted, 0].tolist()

	"""测试正常情况下的target处理"""
	"""测试正常情况下的draft处理"""

[Speculative Decoding] split draft_tokens into standalone post-processing path #5205

[Speculative Decoding] split draft_tokens into standalone post-processing path #5205

Uh oh!

Conversation

sunlei1024 commented Nov 25, 2025

Motivation

Modifications

Usage or Command

/v1/chat/completions Example

/v1/completions Example

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 25, 2025

Uh oh!

codecov-commenter commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Nov 25, 2025 •

edited

Loading