server: improve speed of speculative decoding#17808
Conversation
|
server tests passed locally, this should be ready for review @ggerganov |
|
Just a cosmetic bug: in the |
|
@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API) Edit: I think I need more details on the bug, as well as step-by-step reproduction. Feel free to open a dedicated issue. |
Both of them (logs and UI) were broken in cases when the draft batch was always accepted (e.g. "count from 1 to 100"). I fixed it with f74d1ee. |
| - Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216 | ||
| - Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362 | ||
| - Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470 | ||
| - Speculative decoding: https://github.com/ggml-org/llama.cpp/pull/17808 and rework in https://github.com/ggml-org/llama.cpp/pull/17808 |
There was a problem hiding this comment.
both PR numbers are the same
EDIT: the first should have been #10455 I guess?
* server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Fix #12968
I'm testing with:
So far the results are coherent.
How it works: