-
Notifications
You must be signed in to change notification settings - Fork 693
[Speculative Decoding]Support different inferseed in speculate decoding #5568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative Decoding]Support different inferseed in speculate decoding #5568
Conversation
|
Thanks for your contribution! |
f7dab52 to
67defda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for different inference seeds in speculative decoding by modifying the seed increment logic and seed parameter handling in the sampling process.
- Adjusts
infer_seed_incrementcalculation ingpu_model_runner.pyto account for speculative token counts - Updates
padding_sampling_paramsto generate per-token seeds with proper offsets - Changes SpeculativeSampler to use per-token seeds instead of a single global seed
- Modifies MTPSampler to use greedy decoding (argmax) instead of stochastic sampling
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| fastdeploy/worker/gpu_model_runner.py | Updates infer_seed_increment calculation to scale with speculative token count |
| fastdeploy/model_executor/layers/sample/sampler.py | Adds per-token seed generation logic and updates sampling methods to use token-specific seeds |
| top_p_padding = paddle.repeat_interleave(top_p[:real_bsz], repeats).unsqueeze(1) | ||
| top_k_padding = paddle.repeat_interleave(top_k[:real_bsz], repeats).unsqueeze(1) | ||
| return top_p_padding, top_k_padding | ||
| topp_seed = paddle.repeat_interleave(infer_seed[:real_bsz], repeats).unsqueeze(1) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'topp_seed' is inconsistent with the existing codebase naming convention. The parameter name in top_k_top_p_sampling function is 'topp_seed' (lowercase), but this could be confused with 'top_p_seed'. Consider using a clearer name like 'infer_seed_padded' or 'per_token_seed' to better reflect that this contains per-token seed values after padding and offset calculations.
| idx = 0 | ||
| for i in range(real_bsz): | ||
| if seq_lens_encoder[i] == 0: | ||
| seq_len_this_time = seq_lens_this_time[i] | ||
|
|
||
| offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype) | ||
| topp_seed[idx : idx + seq_len_this_time, 0] = ( | ||
| topp_seed[idx : idx + seq_len_this_time, 0] + offsets | ||
| ) % MAX_INFER_SEED | ||
|
|
||
| idx += seq_len_this_time | ||
| else: | ||
| idx += 1 |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python for loop iterating over real_bsz could be a performance bottleneck for large batch sizes. Consider vectorizing this operation using Paddle tensor operations to avoid the sequential processing, especially since the modulo operation and offset calculation could potentially be batched.
| idx = 0 | |
| for i in range(real_bsz): | |
| if seq_lens_encoder[i] == 0: | |
| seq_len_this_time = seq_lens_this_time[i] | |
| offsets = 4 * paddle.arange(seq_len_this_time, dtype=topp_seed.dtype) | |
| topp_seed[idx : idx + seq_len_this_time, 0] = ( | |
| topp_seed[idx : idx + seq_len_this_time, 0] + offsets | |
| ) % MAX_INFER_SEED | |
| idx += seq_len_this_time | |
| else: | |
| idx += 1 | |
| # Vectorized version of the loop for performance | |
| mask = (seq_lens_encoder[:real_bsz] == 0) | |
| seq_lens = seq_lens_this_time | |
| # Compute the start indices for each sequence in the repeated tensors | |
| # For each batch, the start index is the sum of previous sequence lengths | |
| start_indices = paddle.cumsum( | |
| paddle.concat([paddle.zeros([1], dtype=seq_lens.dtype), seq_lens[:-1]]), axis=0 | |
| ) | |
| # Only process the indices where mask is True | |
| masked_indices = paddle.nonzero(mask).flatten() | |
| for i in masked_indices.numpy().tolist(): | |
| seq_len = seq_lens[i].item() | |
| start = start_indices[i].item() | |
| offsets = 4 * paddle.arange(seq_len, dtype=topp_seed.dtype) | |
| topp_seed[start : start + seq_len, 0] = ( | |
| topp_seed[start : start + seq_len, 0] + offsets | |
| ) % MAX_INFER_SEED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO,下个PR优化
| share_inputs["seq_lens_encoder"], | ||
| ) | ||
| _, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0]) | ||
| next_tokens = paddle.argmax(probs, axis=-1) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing the sampling method with paddle.argmax effectively changes the sampling strategy from stochastic (top-k/top-p sampling) to deterministic (greedy decoding). This is a significant behavioral change for the MTPSampler's forward_cuda method that could break existing functionality. If deterministic sampling is intended for this specific case, it should be documented and validated; otherwise, the top-k/top-p sampling logic should be retained with proper seed handling similar to the SpeculativeSampler implementation.
| next_tokens = paddle.argmax(probs, axis=-1) | |
| _, next_tokens = top_k_top_p_sampling( | |
| probs, sampling_metadata.top_p, sampling_metadata.top_k, sampling_metadata.top_k_list | |
| ) |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5568 +/- ##
==========================================
Coverage ? 62.15%
==========================================
Files ? 329
Lines ? 41562
Branches ? 6345
==========================================
Hits ? 25834
Misses ? 13791
Partials ? 1937
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| share_inputs["seq_lens_encoder"], | ||
| ) | ||
| _, next_tokens = top_k_top_p_sampling(probs, top_p=top_p, top_k=top_k, seed=sampling_metadata.seed[0, 0]) | ||
| next_tokens = paddle.argmax(probs, axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是直接放弃MTP的top_p_top_k_sampling了吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前所有场景下,这个都没问题,先这么提下性能,等过几天,这里会重构下~
) (#5597) * fix mtp entropy drop in RL * optimize usage and fix unit test * optimize padding_sampling_params speed(vectorized)
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.