Skip to content

server: improve speed of speculative decoding#17808

Merged
ngxson merged 9 commits intoggml-org:masterfrom
ngxson:xsn/server_improve_spec
Dec 8, 2025
Merged

server: improve speed of speculative decoding#17808
ngxson merged 9 commits intoggml-org:masterfrom
ngxson:xsn/server_improve_spec

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 5, 2025

Fix #12968

I'm testing with:

So far the results are coherent.

How it works:

image

@ngxson ngxson marked this pull request as ready for review December 6, 2025 14:53
@ngxson ngxson requested a review from ggerganov as a code owner December 6, 2025 14:53
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 6, 2025

server tests passed locally, this should be ready for review @ggerganov

@theo77186
Copy link
Contributor

Just a cosmetic bug: in the llama-server logs, the eval time is 0.00ms, thus the total time only accounts for the prompt processing time. It also causes the eval tokens per second to be meaningless. The model outputs seem to be correct, though.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 8, 2025

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Edit: I think I need more details on the bug, as well as step-by-step reproduction. Feel free to open a dedicated issue.

@ggerganov
Copy link
Member

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Both of them (logs and UI) were broken in cases when the draft batch was always accepted (e.g. "count from 1 to 100"). I fixed it with f74d1ee.

@ngxson ngxson merged commit f896d2c into ggml-org:master Dec 8, 2025
68 of 69 checks passed
- Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
- Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
- Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470
- Speculative decoding: https://github.com/ggml-org/llama.cpp/pull/17808 and rework in https://github.com/ggml-org/llama.cpp/pull/17808
Copy link
Contributor

@Nindaleth Nindaleth Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both PR numbers are the same
EDIT: the first should have been #10455 I guess?

0Marble pushed a commit to 0Marble/llama.cpp that referenced this pull request Dec 18, 2025
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-server speculative decoding not as performant as llama-speculative-simple

4 participants