[Cherry-Pick][CI]Support different inferseed in speculate decoding(#5568) #5597

freeliuzc · 2025-12-16T11:41:32Z

Motivation

目前FD的设计是一整个batch推理时共用一个seed（int），用作sample采样的随机数，每个step自增一次，在非投机解码下，query级别产出的token来自于不同的 seed
投机解码下，由于每次可能产出两个Drafttoken，相当于每两个token共用一个，这种情况下会降低10%左右的多样性。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

seed 改为 int* tensor，让batch中每个token的seed都保持不一致
[Speculative Decoding]Support different inferseed in speculate decoding #5568

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-16T11:41:48Z

Thanks for your contribution!

Copilot

Pull request overview

This PR is a cherry-pick from #5568 that addresses a seed diversity issue in speculative decoding. The changes modify the seed handling from a single shared integer to per-token seed tensors, resolving a ~10% diversity reduction that occurred when draft tokens shared the same seed value.

Key Changes:

Modified seed representation from scalar to tensor to provide unique seeds per token in a batch
Updated padding_sampling_params to handle seed offset calculation for decoder sequences
Adjusted seed increment values in GPU model runner to account for speculative decoding token counts

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tests/layers/test_speculative_sampler.py	Updated test seed initialization and added new unit tests for the padding functionality with seed offsets
fastdeploy/worker/gpu_model_runner.py	Modified seed increment calculation to account for speculative token count in speculative decoding mode
fastdeploy/model_executor/layers/sample/sampler.py	Extended `padding_sampling_params` to handle per-token seed generation with offsets, updated sampling calls to use tensor seeds, and changed MTPSampler to use argmax

Comments suppressed due to low confidence (1)

tests/layers/test_speculative_sampler.py:282

The newly added test functions test_padding_sampling_params_basic and test_padding_sampling_params_seed_offset are not being called in the __main__ block. These tests should be added to ensure the new padding functionality is properly tested when the test file is run directly.

if __name__ == "__main__":
    test_speculative_sampler()
    test_speculative_sampler_logprobs()
    test_mtp_sampler()
    test_mtp_sampler_logprobs()

Copilot · 2025-12-16T12:05:42Z