-
Notifications
You must be signed in to change notification settings - Fork 694
[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) #5597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) #5597
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR is a cherry-pick from #5568 that addresses a seed diversity issue in speculative decoding. The changes modify the seed handling from a single shared integer to per-token seed tensors, resolving a ~10% diversity reduction that occurred when draft tokens shared the same seed value.
Key Changes:
- Modified seed representation from scalar to tensor to provide unique seeds per token in a batch
- Updated
padding_sampling_paramsto handle seed offset calculation for decoder sequences - Adjusted seed increment values in GPU model runner to account for speculative decoding token counts
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tests/layers/test_speculative_sampler.py | Updated test seed initialization and added new unit tests for the padding functionality with seed offsets |
| fastdeploy/worker/gpu_model_runner.py | Modified seed increment calculation to account for speculative token count in speculative decoding mode |
| fastdeploy/model_executor/layers/sample/sampler.py | Extended padding_sampling_params to handle per-token seed generation with offsets, updated sampling calls to use tensor seeds, and changed MTPSampler to use argmax |
Comments suppressed due to low confidence (1)
tests/layers/test_speculative_sampler.py:282
- The newly added test functions
test_padding_sampling_params_basicandtest_padding_sampling_params_seed_offsetare not being called in the__main__block. These tests should be added to ensure the new padding functionality is properly tested when the test file is run directly.
if __name__ == "__main__":
test_speculative_sampler()
test_speculative_sampler_logprobs()
test_mtp_sampler()
test_mtp_sampler_logprobs()
| idx = 0 | ||
| for i in range(real_bsz): | ||
| if seq_lens_encoder[i] == 0: | ||
| seq_len_this_time = seq_lens_this_time[i] | ||
|
|
||
| offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype) | ||
| topp_seed[idx : idx + seq_len_this_time, 0] = ( | ||
| topp_seed[idx : idx + seq_len_this_time, 0] + offsets | ||
| ) % MAX_INFER_SEED | ||
|
|
||
| idx += seq_len_this_time | ||
| else: | ||
| idx += 1 |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed offset calculation uses a Python for loop which may be inefficient for large batch sizes. Consider vectorizing this operation using paddle operations to avoid the Python loop overhead. For example, you could construct a mask tensor for decoder sequences and apply offsets in a vectorized manner.
| idx = 0 | |
| for i in range(real_bsz): | |
| if seq_lens_encoder[i] == 0: | |
| seq_len_this_time = seq_lens_this_time[i] | |
| offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype) | |
| topp_seed[idx : idx + seq_len_this_time, 0] = ( | |
| topp_seed[idx : idx + seq_len_this_time, 0] + offsets | |
| ) % MAX_INFER_SEED | |
| idx += seq_len_this_time | |
| else: | |
| idx += 1 | |
| # Vectorized implementation for seed offset calculation | |
| # Create a mask for decoder sequences (where seq_lens_encoder == 0) | |
| mask = seq_lens_encoder[:real_bsz] == 0 | |
| # Get the indices of decoder sequences | |
| decoder_indices = paddle.nonzero(mask).flatten() | |
| # For each decoder sequence, calculate the start and end indices in topp_seed | |
| seq_lens = seq_lens_this_time[decoder_indices] | |
| # Compute the start indices for each decoder sequence in topp_seed | |
| # The repeats array tells us how many rows each batch element occupies in topp_seed | |
| repeats_np = repeats.numpy() | |
| start_indices = [] | |
| idx = 0 | |
| for r in repeats_np: | |
| start_indices.append(idx) | |
| idx += int(r) | |
| start_indices = paddle.to_tensor(start_indices, dtype='int64') | |
| decoder_start_indices = paddle.gather(start_indices, decoder_indices) | |
| # For each decoder sequence, generate the offsets and apply them | |
| for i in range(decoder_indices.shape[0]): | |
| start = int(decoder_start_indices[i].item()) | |
| length = int(seq_lens[i].item()) | |
| offsets = 4 * paddle.arange(length, dtype=topp_seed.dtype) | |
| topp_seed[start : start + length, 0] = ( | |
| topp_seed[start : start + length, 0] + offsets | |
| ) % MAX_INFER_SEED |
|
|
||
|
|
||
| def padding_sampling_params(top_p, top_k, seq_lens_this_time, seq_lens_encoder): | ||
| def padding_sampling_params(top_p, top_k, infer_seed, seq_lens_this_time, seq_lens_encoder): |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function padding_sampling_params lacks documentation explaining its purpose, parameters, and return values. This is especially important given the complex logic for seed offset calculation. Consider adding a docstring that explains: the purpose of the function in the context of speculative decoding, what each parameter represents, the seed offset calculation logic (e.g., why decoder sequences get 4*k offsets), and what the three return values represent.
| def padding_sampling_params(top_p, top_k, infer_seed, seq_lens_this_time, seq_lens_encoder): | |
| def padding_sampling_params(top_p, top_k, infer_seed, seq_lens_this_time, seq_lens_encoder): | |
| """ | |
| Prepare and pad sampling parameters for each token position in a batch, supporting speculative decoding. | |
| This function expands the per-sequence sampling parameters (top_p, top_k, infer_seed) to per-token parameters, | |
| taking into account whether each sequence is a decoder sequence (seq_lens_encoder == 0) or not. | |
| For decoder sequences, the function offsets the random seed for each token position by 4*k, where k is the token index, | |
| to ensure different random seeds for each token during speculative decoding. This helps avoid repeated sampling results | |
| when generating multiple tokens in a single forward pass. | |
| Args: | |
| top_p (paddle.Tensor): Tensor of shape [batch_size] containing top-p sampling values for each sequence. | |
| top_k (paddle.Tensor): Tensor of shape [batch_size] containing top-k sampling values for each sequence. | |
| infer_seed (paddle.Tensor): Tensor of shape [batch_size] containing the random seed for each sequence. | |
| seq_lens_this_time (paddle.Tensor): Tensor of shape [batch_size] indicating the number of tokens to generate for each sequence in this step. | |
| seq_lens_encoder (paddle.Tensor): Tensor of shape [batch_size] indicating the encoder sequence length for each sequence (0 for decoder-only). | |
| Returns: | |
| top_p_padding (paddle.Tensor): Padded top-p values, shape [sum(seq_lens_this_time), 1], one per token position. | |
| top_k_padding (paddle.Tensor): Padded top-k values, shape [sum(seq_lens_this_time), 1], one per token position. | |
| topp_seed (paddle.Tensor): Padded and offset random seeds, shape [sum(seq_lens_this_time), 1], one per token position. | |
| """ |
| if seq_lens_encoder[i] == 0: | ||
| seq_len_this_time = seq_lens_this_time[i] | ||
|
|
||
| offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offset calculation uses a hardcoded multiplier of 4, which should be documented or extracted as a named constant. This magic number appears related to the seed increment logic but the relationship is not clear from the code. Consider defining this as a constant with a descriptive name (e.g., SEED_OFFSET_STEP) to improve code clarity and maintainability.
| share_inputs["seq_lens_encoder"], | ||
| ) | ||
| _, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0]) | ||
| next_tokens = paddle.argmax(probs, axis=-1) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MTPSampler forward_cuda method has been changed from sampling (top_k_top_p_sampling) to deterministic selection (argmax). This is a significant behavior change that replaces probabilistic sampling with greedy decoding. This change appears to be unrelated to the seed diversity improvements and may negatively impact model output quality and diversity. If this change is intentional, it should be documented and explained in the PR description.
| next_tokens = paddle.argmax(probs, axis=-1) | |
| _, next_tokens = top_k_top_p_sampling( | |
| probs, sampling_metadata.top_p, sampling_metadata.top_k, sampling_metadata.top_k_list | |
| ) |
| return top_p_padding, top_k_padding | ||
| topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1) | ||
|
|
||
| MAX_INFER_SEED = 9223372036854775806 |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 9223372036854775806 (which is 2^63 - 2) should be defined as a named constant with an explanatory comment, ideally at the module level or as a class constant. This would improve code readability and maintainability. Consider adding a comment explaining why this specific value is chosen (e.g., maximum value for int64 with safety margin).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #5597 +/- ##
==============================================
Coverage ? 59.00%
==============================================
Files ? 327
Lines ? 40672
Branches ? 6177
==============================================
Hits ? 23999
Misses ? 14811
Partials ? 1862
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.