Skip to content

Conversation

@freeliuzc
Copy link
Collaborator

@freeliuzc freeliuzc commented Dec 15, 2025

Motivation

  1. 目前FD的设计是一整个batch推理时共用一个seed(int),用作sample采样的随机数,每个step自增一次,在非投机解码下,query级别产出的token来自于不同的 seed
  2. 投机解码下,由于每次可能产出两个Drafttoken,相当于每两个token共用一个,这种情况下会降低10%左右的多样性。

Modifications

  1. seed 改为 int* tensor,让batch中每个token的seed都保持不一致

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 15, 2025 11:33
@paddle-bot
Copy link

paddle-bot bot commented Dec 15, 2025

Thanks for your contribution!

@freeliuzc freeliuzc force-pushed the merge_fix_entropy_dev branch from f7dab52 to 67defda Compare December 15, 2025 11:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for different inference seeds in speculative decoding by modifying the seed increment logic and seed parameter handling in the sampling process.

  • Adjusts infer_seed_increment calculation in gpu_model_runner.py to account for speculative token counts
  • Updates padding_sampling_params to generate per-token seeds with proper offsets
  • Changes SpeculativeSampler to use per-token seeds instead of a single global seed
  • Modifies MTPSampler to use greedy decoding (argmax) instead of stochastic sampling

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
fastdeploy/worker/gpu_model_runner.py Updates infer_seed_increment calculation to scale with speculative token count
fastdeploy/model_executor/layers/sample/sampler.py Adds per-token seed generation logic and updates sampling methods to use token-specific seeds

top_p_padding = paddle.repeat_interleave(top_p[:real_bsz], repeats).unsqueeze(1)
top_k_padding = paddle.repeat_interleave(top_k[:real_bsz], repeats).unsqueeze(1)
return top_p_padding, top_k_padding
topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'topp_seed' is inconsistent with the existing codebase naming convention. The parameter name in top_k_top_p_sampling function is 'topp_seed' (lowercase), but this could be confused with 'top_p_seed'. Consider using a clearer name like 'infer_seed_padded' or 'per_token_seed' to better reflect that this contains per-token seed values after padding and offset calculations.

Copilot uses AI. Check for mistakes.
Comment on lines 68 to 80
idx = 0
for i in range(real_bsz):
if seq_lens_encoder[i] == 0:
seq_len_this_time = seq_lens_this_time[i]

offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
topp_seed[idx : idx + seq_len_this_time, 0] = (
topp_seed[idx : idx + seq_len_this_time, 0] + offsets
) % MAX_INFER_SEED

idx += seq_len_this_time
else:
idx += 1
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python for loop iterating over real_bsz could be a performance bottleneck for large batch sizes. Consider vectorizing this operation using Paddle tensor operations to avoid the sequential processing, especially since the modulo operation and offset calculation could potentially be batched.

Suggested change
idx = 0
for i in range(real_bsz):
if seq_lens_encoder[i] == 0:
seq_len_this_time = seq_lens_this_time[i]
offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
topp_seed[idx : idx + seq_len_this_time, 0] = (
topp_seed[idx : idx + seq_len_this_time, 0] + offsets
) % MAX_INFER_SEED
idx += seq_len_this_time
else:
idx += 1
# Vectorized version of the loop for performance
mask = (seq_lens_encoder[:real_bsz] == 0)
seq_lens = seq_lens_this_time
# Compute the start indices for each sequence in the repeated tensors
# For each batch, the start index is the sum of previous sequence lengths
start_indices = paddle.cumsum(
paddle.concat([paddle.zeros([1], dtype=seq_lens.dtype), seq_lens[:-1]]), axis=0
)
# Only process the indices where mask is True
masked_indices = paddle.nonzero(mask).flatten()
for i in masked_indices.numpy().tolist():
seq_len = seq_lens[i].item()
start = start_indices[i].item()
offsets = 4 * paddle.arange(seq_len, dtype=topp_seed.dtype)
topp_seed[start : start + seq_len, 0] = (
topp_seed[start : start + seq_len, 0] + offsets
) % MAX_INFER_SEED

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO,下个PR优化

share_inputs["seq_lens_encoder"],
)
_, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0])
next_tokens = paddle.argmax(probs, axis=-1)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the sampling method with paddle.argmax effectively changes the sampling strategy from stochastic (top-k/top-p sampling) to deterministic (greedy decoding). This is a significant behavioral change for the MTPSampler's forward_cuda method that could break existing functionality. If deterministic sampling is intended for this specific case, it should be documented and validated; otherwise, the top-k/top-p sampling logic should be retained with proper seed handling similar to the SpeculativeSampler implementation.

Suggested change
next_tokens = paddle.argmax(probs, axis=-1)
_, next_tokens = top_k_top_p_sampling(
probs, sampling_metadata.top_p, sampling_metadata.top_k, sampling_metadata.top_k_list
)

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@21fa2ba). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5568   +/-   ##
==========================================
  Coverage           ?   62.15%           
==========================================
  Files              ?      329           
  Lines              ?    41562           
  Branches           ?     6345           
==========================================
  Hits               ?    25834           
  Misses             ?    13791           
  Partials           ?     1937           
Flag Coverage Δ
GPU 62.15% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle PaddlePaddle deleted a comment from Copilot AI Dec 16, 2025
@PaddlePaddle PaddlePaddle deleted a comment from Copilot AI Dec 16, 2025
@PaddlePaddle PaddlePaddle deleted a comment from Copilot AI Dec 16, 2025
@PaddlePaddle PaddlePaddle deleted a comment from Copilot AI Dec 16, 2025
share_inputs["seq_lens_encoder"],
)
_, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0])
next_tokens = paddle.argmax(probs, axis=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是直接放弃MTP的top_p_top_k_sampling了吗

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前所有场景下,这个都没问题,先这么提下性能,等过几天,这里会重构下~

Deleter-D
Deleter-D previously approved these changes Dec 16, 2025
@freeliuzc freeliuzc merged commit 15f5112 into PaddlePaddle:develop Dec 17, 2025
15 of 18 checks passed
freeliuzc added a commit that referenced this pull request Dec 17, 2025
) (#5597)

* fix mtp entropy drop in RL

* optimize usage and fix unit test

* optimize padding_sampling_params speed(vectorized)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants