Misc. bug: KV Cache seems to be initialized twice for the draft model?

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 4906 (60c90292)
built with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
llama-server
      --port 9002
      --metrics
      --slots
      -m /models/Qwen_QwQ-32B-IQ4_XS.gguf
      -ngl 999
      --ctx-size 32768
      --no-context-shift
      -fa
      -ctv q8_0
      -ctk q8_0
      -md /models/Qwen2.5-0.5B-Instruct-IQ4_XS.gguf
      -ngld 99
      --draft-p-min 0.5
      --draft-min 0
      --draft-max 15
```

### Problem description & steps to reproduce

When I run llama-server in the output the KV cache for the main model is initialized first as expected:

```
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1
init:      ROCm0 KV buffer size =  4352.00 MiB
llama_context: KV self size  = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
llama_context:      ROCm0 compute buffer size =   325.08 MiB
llama_context:  ROCm_Host compute buffer size =    74.01 MiB
llama_context: graph nodes  = 1991
llama_context: graph splits = 2
```

However, then the KV cache for the draft model is initialized and the output seems to be double:

```
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =   204.00 MiB
llama_context: KV self size  =  204.00 MiB, K (q8_0):  102.00 MiB, V (q8_0):  102.00 MiB
llama_context:      ROCm0 compute buffer size =   300.26 MiB
llama_context:  ROCm_Host compute buffer size =    65.76 MiB
llama_context: graph nodes  = 751
llama_context: graph splits = 50
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 32768
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =   384.00 MiB
llama_context: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context:      ROCm0 compute buffer size =   300.25 MiB
llama_context:  ROCm_Host compute buffer size =    65.76 MiB
llama_context: graph nodes  = 751
llama_context: graph splits = 2
```

Interestingly, the first output shows it being initialized with q8_0 quantization, the same as what I am using for the main model, while the second output shows FP16 KV cache. My questions are:

1. Is the output correct, and is it really initializing the KV cache for the draft model twice? Or is this a display error?
2. Is it using KV quantization for the draft model or not? Because the log output is contradicting itself

I am using QwQ-32B as the main model, and using Qwen2.5-0.5B-Instruct with QwQ's tokenizer to ensure I can use it as a draft model. I am seeing a good ~1.5-2x speedup, so things seem to be working fine. But if it's really initializing the draft model twice, it might mean it's wasting VRAM for no reason.

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: KV Cache seems to be initialized twice for the draft model? #12436

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: KV Cache seems to be initialized twice for the draft model? #12436

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions