Skip to content

Conversation

@freeliuzc
Copy link
Collaborator

Motivation

  1. 目前FD的设计是一整个batch推理时共用一个seed(int),用作sample采样的随机数,每个step自增一次,在非投机解码下,query级别产出的token来自于不同的 seed
  2. 投机解码下,由于每次可能产出两个Drafttoken,相当于每两个token共用一个,这种情况下会降低10%左右的多样性。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  1. seed 改为 int* tensor,让batch中每个token的seed都保持不一致
  2. [Speculative Decoding]Support different inferseed in speculate decoding #5568

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 16, 2025 11:41
@paddle-bot
Copy link

paddle-bot bot commented Dec 16, 2025

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a cherry-pick from #5568 that addresses a seed diversity issue in speculative decoding. The changes modify the seed handling from a single shared integer to per-token seed tensors, resolving a ~10% diversity reduction that occurred when draft tokens shared the same seed value.

Key Changes:

  • Modified seed representation from scalar to tensor to provide unique seeds per token in a batch
  • Updated padding_sampling_params to handle seed offset calculation for decoder sequences
  • Adjusted seed increment values in GPU model runner to account for speculative decoding token counts

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tests/layers/test_speculative_sampler.py Updated test seed initialization and added new unit tests for the padding functionality with seed offsets
fastdeploy/worker/gpu_model_runner.py Modified seed increment calculation to account for speculative token count in speculative decoding mode
fastdeploy/model_executor/layers/sample/sampler.py Extended padding_sampling_params to handle per-token seed generation with offsets, updated sampling calls to use tensor seeds, and changed MTPSampler to use argmax
Comments suppressed due to low confidence (1)

tests/layers/test_speculative_sampler.py:282

  • The newly added test functions test_padding_sampling_params_basic and test_padding_sampling_params_seed_offset are not being called in the __main__ block. These tests should be added to ensure the new padding functionality is properly tested when the test file is run directly.
if __name__ == "__main__":
    test_speculative_sampler()
    test_speculative_sampler_logprobs()
    test_mtp_sampler()
    test_mtp_sampler_logprobs()

Comment on lines 68 to 80
idx = 0
for i in range(real_bsz):
if seq_lens_encoder[i] == 0:
seq_len_this_time = seq_lens_this_time[i]

offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
topp_seed[idx : idx + seq_len_this_time, 0] = (
topp_seed[idx : idx + seq_len_this_time, 0] + offsets
) % MAX_INFER_SEED

idx += seq_len_this_time
else:
idx += 1
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed offset calculation uses a Python for loop which may be inefficient for large batch sizes. Consider vectorizing this operation using paddle operations to avoid the Python loop overhead. For example, you could construct a mask tensor for decoder sequences and apply offsets in a vectorized manner.

Suggested change
idx = 0
for i in range(real_bsz):
if seq_lens_encoder[i] == 0:
seq_len_this_time = seq_lens_this_time[i]
offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
topp_seed[idx : idx + seq_len_this_time, 0] = (
topp_seed[idx : idx + seq_len_this_time, 0] + offsets
) % MAX_INFER_SEED
idx += seq_len_this_time
else:
idx += 1
# Vectorized implementation for seed offset calculation
# Create a mask for decoder sequences (where seq_lens_encoder == 0)
mask = seq_lens_encoder[:real_bsz] == 0
# Get the indices of decoder sequences
decoder_indices = paddle.nonzero(mask).flatten()
# For each decoder sequence, calculate the start and end indices in topp_seed
seq_lens = seq_lens_this_time[decoder_indices]
# Compute the start indices for each decoder sequence in topp_seed
# The repeats array tells us how many rows each batch element occupies in topp_seed
repeats_np = repeats.numpy()
start_indices = []
idx = 0
for r in repeats_np:
start_indices.append(idx)
idx += int(r)
start_indices = paddle.to_tensor(start_indices, dtype='int64')
decoder_start_indices = paddle.gather(start_indices, decoder_indices)
# For each decoder sequence, generate the offsets and apply them
for i in range(decoder_indices.shape[0]):
start = int(decoder_start_indices[i].item())
length = int(seq_lens[i].item())
offsets = 4 * paddle.arange(length, dtype=topp_seed.dtype)
topp_seed[start : start + length, 0] = (
topp_seed[start : start + length, 0] + offsets
) % MAX_INFER_SEED

Copilot uses AI. Check for mistakes.


def padding_sampling_params(top_p, top_k, seq_lens_this_time, seq_lens_encoder):
def padding_sampling_params(top_p, top_k, infer_seed, seq_lens_this_time, seq_lens_encoder):
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function padding_sampling_params lacks documentation explaining its purpose, parameters, and return values. This is especially important given the complex logic for seed offset calculation. Consider adding a docstring that explains: the purpose of the function in the context of speculative decoding, what each parameter represents, the seed offset calculation logic (e.g., why decoder sequences get 4*k offsets), and what the three return values represent.

Suggested change
def padding_sampling_params(top_p, top_k, infer_seed, seq_lens_this_time, seq_lens_encoder):
def padding_sampling_params(top_p, top_k, infer_seed, seq_lens_this_time, seq_lens_encoder):
"""
Prepare and pad sampling parameters for each token position in a batch, supporting speculative decoding.
This function expands the per-sequence sampling parameters (top_p, top_k, infer_seed) to per-token parameters,
taking into account whether each sequence is a decoder sequence (seq_lens_encoder == 0) or not.
For decoder sequences, the function offsets the random seed for each token position by 4*k, where k is the token index,
to ensure different random seeds for each token during speculative decoding. This helps avoid repeated sampling results
when generating multiple tokens in a single forward pass.
Args:
top_p (paddle.Tensor): Tensor of shape [batch_size] containing top-p sampling values for each sequence.
top_k (paddle.Tensor): Tensor of shape [batch_size] containing top-k sampling values for each sequence.
infer_seed (paddle.Tensor): Tensor of shape [batch_size] containing the random seed for each sequence.
seq_lens_this_time (paddle.Tensor): Tensor of shape [batch_size] indicating the number of tokens to generate for each sequence in this step.
seq_lens_encoder (paddle.Tensor): Tensor of shape [batch_size] indicating the encoder sequence length for each sequence (0 for decoder-only).
Returns:
top_p_padding (paddle.Tensor): Padded top-p values, shape [sum(seq_lens_this_time), 1], one per token position.
top_k_padding (paddle.Tensor): Padded top-k values, shape [sum(seq_lens_this_time), 1], one per token position.
topp_seed (paddle.Tensor): Padded and offset random seeds, shape [sum(seq_lens_this_time), 1], one per token position.
"""

Copilot uses AI. Check for mistakes.
if seq_lens_encoder[i] == 0:
seq_len_this_time = seq_lens_this_time[i]

offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset calculation uses a hardcoded multiplier of 4, which should be documented or extracted as a named constant. This magic number appears related to the seed increment logic but the relationship is not clear from the code. Consider defining this as a constant with a descriptive name (e.g., SEED_OFFSET_STEP) to improve code clarity and maintainability.

Copilot uses AI. Check for mistakes.
share_inputs["seq_lens_encoder"],
)
_, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0])
next_tokens = paddle.argmax(probs, axis=-1)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MTPSampler forward_cuda method has been changed from sampling (top_k_top_p_sampling) to deterministic selection (argmax). This is a significant behavior change that replaces probabilistic sampling with greedy decoding. This change appears to be unrelated to the seed diversity improvements and may negatively impact model output quality and diversity. If this change is intentional, it should be documented and explained in the PR description.

Suggested change
next_tokens = paddle.argmax(probs, axis=-1)
_, next_tokens = top_k_top_p_sampling(
probs, sampling_metadata.top_p, sampling_metadata.top_k, sampling_metadata.top_k_list
)

Copilot uses AI. Check for mistakes.
return top_p_padding, top_k_padding
topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1)

MAX_INFER_SEED = 9223372036854775806
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 9223372036854775806 (which is 2^63 - 2) should be defined as a named constant with an explanatory comment, ideally at the module level or as a class constant. This would improve code readability and maintainability. Consider adding a comment explaining why this specific value is chosen (e.g., maximum value for int64 with safety margin).

Copilot uses AI. Check for mistakes.
Deleter-D
Deleter-D previously approved these changes Dec 16, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.4@c19af49). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.4    #5597   +/-   ##
==============================================
  Coverage               ?   59.00%           
==============================================
  Files                  ?      327           
  Lines                  ?    40672           
  Branches               ?     6177           
==============================================
  Hits                   ?    23999           
  Misses                 ?    14811           
  Partials               ?     1862           
Flag Coverage Δ
GPU 59.00% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@freeliuzc freeliuzc merged commit a7359d1 into PaddlePaddle:release/2.4 Dec 17, 2025
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants