Skip to content

Conversation

@sunlei1024
Copy link
Collaborator

Motivation

This PR refactors the speculative decoding pipeline by extracting draft_tokens into a standalone post-processing branch.
The goal is to improve the clarity of the MTP (Multi-Token Prediction) workflow and provide a cleaner and more extensible structure for logprobs generation under draft mode.
This also prepares the codebase for future enhancements in speculative decoding.

Modifications

  • Added a dedicated post-processing path for draft_tokens.
  • Refactored the MTP handling logic to separate draft-token flow from standard completion flow.
  • Updated logprobs computation to properly support draft token outputs.
  • Improved readability and maintainability of the speculative decoding code.
  • Adjusted related data structures and RequestOutput construction to align with the new branch.

Usage or Command

Although this PR does not introduce new public APIs, it affects the behavior of draft-token logprobs in speculative decoding.
Below are example requests demonstrating how draft logprobs can be enabled through existing REST interfaces.

/v1/chat/completions Example

curl --location 'http://120.0.0.1:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "你是谁?"
        }
    ],
    "logprobs": true,
    "top_logprobs": 5,
    "include_draft_logprobs": true,
    "stream": true
}'

/v1/completions Example

curl --location 'http://120.0.0.1:8000/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "你好",
    "stream": false,
    "logprobs": 5,
    "include_draft_logprobs": true
}'

These examples demonstrate how to enable:

  • logprobs
  • top_logprobs
  • include_draft_logprobs
  • streaming / non-streaming outputs

Accuracy Tests

This PR does not affect model forward computation or kernel behavior; it only refactors post-processing logic.
The generated outputs for both:

  • standard MTP mode, and
  • draft token mode with logprobs
    remain consistent with previous behavior.

Unit tests were updated and extended to ensure correctness of:

  • draft token post-processing
  • stop-flag handling
  • variable acceptance count (accept_num)
  • different K top-logprobs values

Checklist

  • Add at least a tag in the PR title.
    • Recommended tag: [Speculative Decoding]
  • Format your code, run pre-commit before commit.
  • Add unit tests.
  • Provide accuracy results.
    (Not required if behavior is unchanged, but leave unchecked per template.)
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Nov 25, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Nov 25, 2025
@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2025

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@cead6b2). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/output/token_processor.py 96.87% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5205   +/-   ##
==========================================
  Coverage           ?   59.07%           
==========================================
  Files              ?      317           
  Lines              ?    38799           
  Branches           ?     5846           
==========================================
  Hits               ?    22920           
  Misses             ?    14060           
  Partials           ?     1819           
Flag Coverage Δ
GPU 59.07% <97.22%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sunlei1024 sunlei1024 changed the title [Speculative Decoding] split draft_tokens into standalone post-processing pat… [Speculative Decoding] split draft_tokens into standalone post-processing path Nov 25, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the speculative decoding pipeline by extracting draft token processing into a dedicated method _process_batch_draft_tokens. The main goal is to improve code clarity and enable proper logprobs generation for draft tokens in Multi-Token Prediction (MTP) workflows.

  • Introduced _process_batch_draft_tokens to handle mtype==4 (draft tokens) separately from mtype==3 (target tokens)
  • Updated the OpenAI serving layer to collect and return draft_logprobs alongside regular logprobs
  • Improved Request.__repr__ to provide meaningful debug output
  • Added comprehensive unit tests for the new draft token processing logic

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
fastdeploy/output/token_processor.py Added _process_batch_draft_tokens method and refactored _process_batch_output to handle draft tokens separately via early return
fastdeploy/entrypoints/openai/serving_chat.py Extended chat completion handlers to collect and return draft_logprobs in responses
fastdeploy/engine/request.py Enhanced Request.__repr__ to show request_id by default and full details in debug mode
tests/output/test_process_batch_draft_tokens.py New test file with comprehensive coverage for draft token processing including edge cases
tests/output/test_process_batch_output_use_zmq.py Added copyright header
Comments suppressed due to low confidence (1)

fastdeploy/output/token_processor.py:943

  • This method requires 2 positional arguments, whereas overridden TokenProcessor.postprocess may be called with 3. This call correctly calls the base method, but does not match the signature of the overriding method.
    def postprocess(self, batch_result):

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Process batch draft tokens and generate corresponding request outputs

Args:
mtype (int): Message type (3=target token, 4=draft token)
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states "Message type (3=target token, 4=draft token)" but this could be clearer. Consider expanding this to explain when each type is used and what the difference is between target and draft tokens in the context of speculative decoding. For example:

mtype (int): Message type indicating token processing mode
    - 3: Target tokens (verified tokens from the main model)
    - 4: Draft tokens (speculative tokens for logprobs collection)
Suggested change
mtype (int): Message type (3=target token, 4=draft token)
mtype (int): Message type indicating token processing mode
- 3: Target tokens (verified tokens from the main model)
- 4: Draft tokens (speculative tokens for logprobs collection)
In speculative decoding, draft tokens are generated for logprobs collection and may be accepted or rejected by the main model, while target tokens are verified outputs from the main model.

Copilot uses AI. Check for mistakes.
metrics=None,
)

token_ids = tokens[i][:, 0].tolist()[: accept_num[i]]
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting the token slicing logic to make it more readable:

# Extract accepted tokens for this batch item
max_accepted = accept_num[i]
token_ids = tokens[i][:max_accepted, 0].tolist()

This makes it clearer that you're getting the first column (the sampled tokens) and limiting to the accepted count.

Suggested change
token_ids = tokens[i][:, 0].tolist()[: accept_num[i]]
# Extract accepted tokens for this batch item
max_accepted = accept_num[i]
token_ids = tokens[i][:max_accepted, 0].tolist()

Copilot uses AI. Check for mistakes.
task.eos_token_ids = [2]

def test_process_batch_draft_tokens_normal_case(self):
"""测试正常情况下的target处理"""
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "测试正常情况下的target处理" (Testing normal case for target processing) is misleading. This test is for draft token processing (mtype=4), not target token processing (mtype=3). The comment should be updated to reflect that this is testing draft token processing:

"""测试正常情况下的draft处理"""  # Testing normal case for draft processing
Suggested change
"""测试正常情况下的target处理"""
"""测试正常情况下的draft处理"""

Copilot uses AI. Check for mistakes.
freeliuzc
freeliuzc previously approved these changes Nov 26, 2025
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit c424e08 into PaddlePaddle:develop Nov 27, 2025
16 of 17 checks passed
Jiang-Jia-Jun pushed a commit that referenced this pull request Nov 27, 2025
…path(#5205) (#5231)

* refactor(mtp): split draft_tokens into standalone post-processing path for MTP + logprobs

* Restore Request.__repr__ implementation

* ci

* add envs

* fix unittest
Jiang-Jia-Jun pushed a commit that referenced this pull request Nov 27, 2025
…path(#5205)  (#5232)

* merge code

* fix Request CONFLICT

* remove unuse unittest

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
@sunlei1024 sunlei1024 deleted the refac/mtp-logprobs branch December 3, 2025 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants