Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 4906 (60c9029)
built with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server
--port 9002
--metrics
--slots
-m /models/Qwen_QwQ-32B-IQ4_XS.gguf
-ngl 999
--ctx-size 32768
--no-context-shift
-fa
-ctv q8_0
-ctk q8_0
-md /models/Qwen2.5-0.5B-Instruct-IQ4_XS.gguf
-ngld 99
--draft-p-min 0.5
--draft-min 0
--draft-max 15
Problem description & steps to reproduce
When I run llama-server in the output the KV cache for the main model is initialized first as expected:
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1
init: ROCm0 KV buffer size = 4352.00 MiB
llama_context: KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
llama_context: ROCm0 compute buffer size = 325.08 MiB
llama_context: ROCm_Host compute buffer size = 74.01 MiB
llama_context: graph nodes = 1991
llama_context: graph splits = 2
However, then the KV cache for the draft model is initialized and the output seems to be double:
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: ROCm_Host output buffer size = 0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 24, can_shift = 1
init: ROCm0 KV buffer size = 204.00 MiB
llama_context: KV self size = 204.00 MiB, K (q8_0): 102.00 MiB, V (q8_0): 102.00 MiB
llama_context: ROCm0 compute buffer size = 300.26 MiB
llama_context: ROCm_Host compute buffer size = 65.76 MiB
llama_context: graph nodes = 751
llama_context: graph splits = 50
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 32768
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: ROCm_Host output buffer size = 0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init: ROCm0 KV buffer size = 384.00 MiB
llama_context: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_context: ROCm0 compute buffer size = 300.25 MiB
llama_context: ROCm_Host compute buffer size = 65.76 MiB
llama_context: graph nodes = 751
llama_context: graph splits = 2
Interestingly, the first output shows it being initialized with q8_0 quantization, the same as what I am using for the main model, while the second output shows FP16 KV cache. My questions are:
- Is the output correct, and is it really initializing the KV cache for the draft model twice? Or is this a display error?
- Is it using KV quantization for the draft model or not? Because the log output is contradicting itself
I am using QwQ-32B as the main model, and using Qwen2.5-0.5B-Instruct with QwQ's tokenizer to ensure I can use it as a draft model. I am seeing a good ~1.5-2x speedup, so things seem to be working fine. But if it's really initializing the draft model twice, it might mean it's wasting VRAM for no reason.
First Bad Commit
No response
Relevant log output
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 4906 (60c9029)
built with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --port 9002 --metrics --slots -m /models/Qwen_QwQ-32B-IQ4_XS.gguf -ngl 999 --ctx-size 32768 --no-context-shift -fa -ctv q8_0 -ctk q8_0 -md /models/Qwen2.5-0.5B-Instruct-IQ4_XS.gguf -ngld 99 --draft-p-min 0.5 --draft-min 0 --draft-max 15Problem description & steps to reproduce
When I run llama-server in the output the KV cache for the main model is initialized first as expected:
However, then the KV cache for the draft model is initialized and the output seems to be double:
Interestingly, the first output shows it being initialized with q8_0 quantization, the same as what I am using for the main model, while the second output shows FP16 KV cache. My questions are:
I am using QwQ-32B as the main model, and using Qwen2.5-0.5B-Instruct with QwQ's tokenizer to ensure I can use it as a draft model. I am seeing a good ~1.5-2x speedup, so things seem to be working fine. But if it's really initializing the draft model twice, it might mean it's wasting VRAM for no reason.
First Bad Commit
No response
Relevant log output