server: improve speed of speculative decoding by ngxson · Pull Request #17808 · ggml-org/llama.cpp

ngxson · 2025-12-05T23:51:30Z

Fix #12968

I'm testing with:

draft model: https://huggingface.co/unsloth/Qwen3-0.6B-GGUF (using Q8_0)
main model: https://huggingface.co/unsloth/Qwen3-8B-GGUF (using Q4_K_M)

So far the results are coherent.

How it works:

ngxson · 2025-12-06T14:53:33Z

server tests passed locally, this should be ready for review @ggerganov

theo77186 · 2025-12-07T19:01:09Z

Just a cosmetic bug: in the llama-server logs, the eval time is 0.00ms, thus the total time only accounts for the prompt processing time. It also causes the eval tokens per second to be meaningless. The model outputs seem to be correct, though.

ngxson · 2025-12-08T10:27:58Z

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Edit: I think I need more details on the bug, as well as step-by-step reproduction. Feel free to open a dedicated issue.

ggerganov · 2025-12-08T10:41:27Z

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Both of them (logs and UI) were broken in cases when the draft batch was always accepted (e.g. "count from 1 to 100"). I fixed it with f74d1ee.

Nindaleth · 2025-12-11T23:52:03Z

tools/server/README-dev.md

 - Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
 - Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
 - Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470
+- Speculative decoding: https://github.com/ggml-org/llama.cpp/pull/17808 and rework in https://github.com/ggml-org/llama.cpp/pull/17808


both PR numbers are the same
EDIT: the first should have been #10455 I guess?

* server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server: improve speed of speculative decoding

f2f08f8

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17808: server: improve speed of speculative decoding auroralabs-loci/llama.cpp#463

Open

github-actions bot added examples server labels Dec 6, 2025

fix small draft case

cac8d7b

ngxson marked this pull request as ready for review December 6, 2025 14:53

ngxson requested a review from ggerganov as a code owner December 6, 2025 14:53

add link to the PR

398ae8d

wishstudio mentioned this pull request Dec 7, 2025

Fix/improve mtp performance F1LM1/llama.cpp#5

Closed

ggerganov added 3 commits December 8, 2025 10:17

server : fix generation time measurement

084cec9

server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

f74d1ee

server : add comment

75be6ba

ggerganov approved these changes Dec 8, 2025

View reviewed changes

Merge branch 'master' into xsn/server_improve_spec

ba5c0b4

ngxson added 2 commits December 8, 2025 14:27

Merge branch 'master' into xsn/server_improve_spec

afe2530

add PR to docs

0a63bd8

ngxson merged commit f896d2c into ggml-org:master Dec 8, 2025
68 of 69 checks passed

SamuelOliveirads mentioned this pull request Dec 10, 2025

Feat/glm4 mtp recursive F1LM1/llama.cpp#6

Merged

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

Nindaleth reviewed Dec 11, 2025

View reviewed changes

firecoperana mentioned this pull request Jan 8, 2026

server: improve speed of speculative decoding ikawrakow/ik_llama.cpp#1119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: improve speed of speculative decoding#17808

server: improve speed of speculative decoding#17808
ngxson merged 9 commits intoggml-org:masterfrom
ngxson:xsn/server_improve_spec

ngxson commented Dec 5, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

theo77186 commented Dec 7, 2025

Uh oh!

ngxson commented Dec 8, 2025 •

edited

Loading

Uh oh!

ggerganov commented Dec 8, 2025

Uh oh!

Uh oh!

Nindaleth Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ngxson commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

theo77186 commented Dec 7, 2025

Uh oh!

ngxson commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 8, 2025

Uh oh!

Uh oh!

Nindaleth Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Dec 5, 2025 •

edited

Loading

ngxson commented Dec 8, 2025 •

edited

Loading

Nindaleth Dec 11, 2025 •

edited

Loading