[Speculative Decoding] Return accepted tokens per head in response #5947
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Based on PR #5518, the number of accepted tokens per head is newly added to the response, with an example return shown below.
{ "id": "xxx", "model": "xxx", "choices": [ { ... "speculate_metrics": { "accepted_tokens": 128, "rejected_tokens": 112, "accept_ratio": 0.53125, "average_accept_length": 2.1333333333333333, "accepted_tokens_per_head": [ 60, 36, 23, 9 ] "accept_ratio_per_head": [ 0.6, 0.6388888888888888, 0.391304347826087 ] } } ], ... }The acceptance rate for each head is calculated as follows.
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.