[Speculative Decoding]Support different inferseed in speculate decoding #5568

freeliuzc · 2025-12-15T11:33:16Z

Motivation

目前FD的设计是一整个batch推理时共用一个seed（int），用作sample采样的随机数，每个step自增一次，在非投机解码下，query级别产出的token来自于不同的 seed
投机解码下，由于每次可能产出两个Drafttoken，相当于每两个token共用一个，这种情况下会降低10%左右的多样性。

Modifications

seed 改为 int* tensor，让batch中每个token的seed都保持不一致

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-15T11:33:25Z

Thanks for your contribution!

Copilot

Pull request overview

This pull request adds support for different inference seeds in speculative decoding by modifying the seed increment logic and seed parameter handling in the sampling process.

Adjusts infer_seed_increment calculation in gpu_model_runner.py to account for speculative token counts
Updates padding_sampling_params to generate per-token seeds with proper offsets
Changes SpeculativeSampler to use per-token seeds instead of a single global seed
Modifies MTPSampler to use greedy decoding (argmax) instead of stochastic sampling

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
fastdeploy/worker/gpu_model_runner.py	Updates infer_seed_increment calculation to scale with speculative token count
fastdeploy/model_executor/layers/sample/sampler.py	Adds per-token seed generation logic and updates sampling methods to use token-specific seeds

Copilot · 2025-12-15T11:37:54Z

fastdeploy/model_executor/layers/sample/sampler.py

    top_p_padding = paddle.repeat_interleave(top_p[:real_bsz], repeats).unsqueeze(1)
    top_k_padding = paddle.repeat_interleave(top_k[:real_bsz], repeats).unsqueeze(1)
-    return top_p_padding, top_k_padding
+    topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1)


The variable name 'topp_seed' is inconsistent with the existing codebase naming convention. The parameter name in top_k_top_p_sampling function is 'topp_seed' (lowercase), but this could be confused with 'top_p_seed'. Consider using a clearer name like 'infer_seed_padded' or 'per_token_seed' to better reflect that this contains per-token seed values after padding and offset calculations.

Copilot · 2025-12-15T11:37:54Z

fastdeploy/model_executor/layers/sample/sampler.py

+    idx = 0
+    for i in range(real_bsz):
+        if seq_lens_encoder[i] == 0:
+            seq_len_this_time = seq_lens_this_time[i]
+
+            offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
+            topp_seed[idx : idx + seq_len_this_time, 0] = (
+                topp_seed[idx : idx + seq_len_this_time, 0] + offsets
+            ) % MAX_INFER_SEED
+
+            idx += seq_len_this_time
+        else:
+            idx += 1


The Python for loop iterating over real_bsz could be a performance bottleneck for large batch sizes. Consider vectorizing this operation using Paddle tensor operations to avoid the sequential processing, especially since the modulo operation and offset calculation could potentially be batched.

Suggested change

idx = 0

for i in range(real_bsz):

if seq_lens_encoder[i] == 0:

seq_len_this_time = seq_lens_this_time[i]

offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)

topp_seed[idx : idx + seq_len_this_time, 0] = (

topp_seed[idx : idx + seq_len_this_time, 0] + offsets

) % MAX_INFER_SEED

idx += seq_len_this_time

else:

idx += 1

# Vectorized version of the loop for performance

mask = (seq_lens_encoder[:real_bsz] == 0)

seq_lens = seq_lens_this_time

# Compute the start indices for each sequence in the repeated tensors

# For each batch, the start index is the sum of previous sequence lengths

start_indices = paddle.cumsum(

paddle.concat([paddle.zeros([1], dtype=seq_lens.dtype), seq_lens[:-1]]), axis=0

)

# Only process the indices where mask is True

masked_indices = paddle.nonzero(mask).flatten()

for i in masked_indices.numpy().tolist():

seq_len = seq_lens[i].item()

start = start_indices[i].item()

offsets = 4 * paddle.arange(seq_len, dtype=topp_seed.dtype)

topp_seed[start : start + seq_len, 0] = (

topp_seed[start : start + seq_len, 0] + offsets

) % MAX_INFER_SEED

TODO，下个PR优化

Copilot · 2025-12-15T11:37:55Z

fastdeploy/model_executor/layers/sample/sampler.py

-            share_inputs["seq_lens_encoder"],
-        )
-        _, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0])
+        next_tokens = paddle.argmax(probs, axis=-1)


Replacing the sampling method with paddle.argmax effectively changes the sampling strategy from stochastic (top-k/top-p sampling) to deterministic (greedy decoding). This is a significant behavioral change for the MTPSampler's forward_cuda method that could break existing functionality. If deterministic sampling is intended for this specific case, it should be documented and validated; otherwise, the top-k/top-p sampling logic should be retained with proper seed handling similar to the SpeculativeSampler implementation.

Suggested change

next_tokens = paddle.argmax(probs, axis=-1)

_, next_tokens = top_k_top_p_sampling(

probs, sampling_metadata.top_p, sampling_metadata.top_k, sampling_metadata.top_k_list

)

codecov-commenter · 2025-12-15T13:29:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@21fa2ba). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5568   +/-   ##
==========================================
  Coverage           ?   62.15%           
==========================================
  Files              ?      329           
  Lines              ?    41562           
  Branches           ?     6345           
==========================================
  Hits               ?    25834           
  Misses             ?    13791           
  Partials           ?     1937

Flag	Coverage Δ
GPU	`62.15% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Deleter-D · 2025-12-16T12:19:22Z

fastdeploy/model_executor/layers/sample/sampler.py

-            share_inputs["seq_lens_encoder"],
-        )
-        _, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0])
+        next_tokens = paddle.argmax(probs, axis=-1)


这里是直接放弃MTP的top_p_top_k_sampling了吗

目前所有场景下，这个都没问题，先这么提下性能，等过几天，这里会重构下~

) (#5597) * fix mtp entropy drop in RL * optimize usage and fix unit test * optimize padding_sampling_params speed(vectorized)

Copilot AI review requested due to automatic review settings December 15, 2025 11:33

Copilot started reviewing on behalf of freeliuzc December 15, 2025 11:33 View session

fix mtp entropy drop in RL

67defda

freeliuzc force-pushed the merge_fix_entropy_dev branch from f7dab52 to 67defda Compare December 15, 2025 11:36

Copilot AI reviewed Dec 15, 2025

View reviewed changes

optimize usage and fix unit test

b75f29e

PaddlePaddle deleted a comment from Copilot AI Dec 16, 2025

freeliuzc mentioned this pull request Dec 16, 2025

[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) #5597

Merged

5 tasks

PaddlePaddle deleted a comment from Copilot AI Dec 16, 2025

Deleter-D reviewed Dec 16, 2025

View reviewed changes

Deleter-D previously approved these changes Dec 16, 2025

View reviewed changes

optimize padding_sampling_params speed(vectorized)

da71b7b

freeliuzc dismissed Deleter-D’s stale review via da71b7b December 16, 2025 13:57

qingqing01 approved these changes Dec 17, 2025

View reviewed changes

Merge branch 'develop' into merge_fix_entropy_dev

055824d

qingqing01 approved these changes Dec 17, 2025

View reviewed changes

freeliuzc merged commit 15f5112 into PaddlePaddle:develop Dec 17, 2025
15 of 18 checks passed

freeliuzc added a commit that referenced this pull request Dec 17, 2025

[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568

a7359d1

) (#5597) * fix mtp entropy drop in RL * optimize usage and fix unit test * optimize padding_sampling_params speed(vectorized)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative Decoding]Support different inferseed in speculate decoding #5568

[Speculative Decoding]Support different inferseed in speculate decoding #5568

Uh oh!

freeliuzc commented Dec 15, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

freeliuzc Dec 16, 2025

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

codecov-commenter commented Dec 15, 2025 •

edited

Loading

Uh oh!

Deleter-D Dec 16, 2025

Uh oh!

freeliuzc Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-    idx = 0
-    for i in range(real_bsz):
-        if seq_lens_encoder[i] == 0:
-            seq_len_this_time = seq_lens_this_time[i]
-            offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
-            topp_seed[idx : idx + seq_len_this_time, 0] = (
-                topp_seed[idx : idx + seq_len_this_time, 0] + offsets
-            ) % MAX_INFER_SEED
-            idx += seq_len_this_time
-        else:
-            idx += 1
+    # Vectorized version of the loop for performance
+    mask = (seq_lens_encoder[:real_bsz] == 0)
+    seq_lens = seq_lens_this_time
+    # Compute the start indices for each sequence in the repeated tensors
+    # For each batch, the start index is the sum of previous sequence lengths
+    start_indices = paddle.cumsum(
+        paddle.concat([paddle.zeros([1], dtype=seq_lens.dtype), seq_lens[:-1]]), axis=0
+    )
+    # Only process the indices where mask is True
+    masked_indices = paddle.nonzero(mask).flatten()
+    for i in masked_indices.numpy().tolist():
+        seq_len = seq_lens[i].item()
+        start = start_indices[i].item()
+        offsets = 4 * paddle.arange(seq_len, dtype=topp_seed.dtype)
+        topp_seed[start : start + seq_len, 0] = (
+            topp_seed[start : start + seq_len, 0] + offsets
+        ) % MAX_INFER_SEED

[Speculative Decoding]Support different inferseed in speculate decoding #5568

[Speculative Decoding]Support different inferseed in speculate decoding #5568

Uh oh!

Conversation

freeliuzc commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

freeliuzc Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Deleter-D Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

freeliuzc Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

freeliuzc commented Dec 15, 2025 •

edited

Loading

codecov-commenter commented Dec 15, 2025 •

edited

Loading