model loading crash using kv-overrides #5745 with llama cuda backend

**LocalAI version:**

Commit: 47b2a502dd2189ec1badfb25d4d494c9478c9927

**Environment, CPU architecture, OS, and Version:**
- Linux sihil 6.14.4-zen1-1-zen  ZEN SMP PREEMPT_DYNAMIC Sat, 26 Apr 2025 00:06:55 +0000 x86_64 GNU/Linux
- Running in docker with nvidia-runtiume container
- Cuda version: 12.8
- Llama Cuda backend installed from the web UI backends interface

**Describe the bug**

I want to set a custom number of expert on a Qwen3 MoE model (qwen3 30b a3b) using llama.cpp backend with the `kv-override` options. 

I tried using the feature implemented in #5745 like following:
```yaml
name: qwen3-test
parameters:
  model: qwen3-30b-a3b-fast.gguf
  backend: cuda12-llama-cpp
  temperature: 0.7
  top_k: 20
  top_p: 0.8
  min_p: 0
  repeat_penality: 1.05

overrides:
  # any nb of experts produces a crash
  - "qwen3moe.expert_used_count=int:8"

context_size: 8192
```
I used the same number of experts as the model's default for debugging. 
Loading the model crashes with a stacktrace no matter how many experts I set. 

The relevant error from the log is `140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed` 

[Refers to this line in Llama.cpp](https://github.com/ggml-org/llama.cpp/blob/03792ad93609fc67e41041c6347d9aa14e5e0d74/common/common.cpp#L1140)

**To Reproduce**

- Load the model with any chat completion call

**Expected behavior**

- The model should load using the custom llama.cpp kv-overrides 

**Logs**


Model loading bug
```
localai-1  | 2:27PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc000857958} sizeCache:0 unknownFields:[] Model:qwen3-30b-a3b-fast.gguf ContextSize:8192 Seed:541601836 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:15 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/qwen3-30b-a3b-fast.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath:/models LoraAdapters:[] LoraScales:[] Options:[gpu] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[qwen3moe.expert_used_count=int:8]}
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr build: 6673 (d64c8104) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr system info: n_threads = 15, n_threads_batch = -1, total_threads = 32
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: found 2 CUDA devices:
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr   Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /LocalAI/backend/cpp/llama-cpp-avx2-build/llama.cpp/common/common.cpp:1140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr system_info: n_threads = 15 / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr srv    load_model: loading model '/models/qwen3-30b-a3b-fast.gguf'
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4518b)[0x756ac0c4518b]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4574f)[0x756ac0c4574f]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4591e)[0x756ac0c4591e]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x287132)[0x756ac0087132]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x28c16c)[0x756ac008c16c]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x1964db)[0x756abff964db]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x1e6310)[0x756abffe6310]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x48165e)[0x756ac028165e]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x49fc28)[0x756ac029fc28]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4a7049)[0x756ac02a7049]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e3d36)[0x756ac02e3d36]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e3ae3)[0x756ac02e3ae3]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e4521)[0x756ac02e4521]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x502da7)[0x756ac0302da7]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x5021b3)[0x756ac03021b3]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x501fee)[0x756ac0301fee]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x502012)[0x756ac0302012]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x17e65ee)[0x756ac15e65ee]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x17e663b)[0x756ac15e663b]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/lib/libc.so.6(+0x94ac3)[0x756ab4494ac3]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/lib/libc.so.6(clone+0x44)[0x756ab4525a74]
localai-1  | 2:27PM ERR Failed to load model qwen3-test with backend localai@llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" modelID=qwen3-test
localai-1  | 2:27PM INF [localai@llama-cpp] Fails: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
localai-1  | 2:27PM ERR Server error error="could not load model - all backends returned error: [llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n[rerankers]: failed to load model with internal loader: could not load model (no success): Unexpected err=OSError(\"qwen3-30b-a3b-fast.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`\"), type(err)=<class 'OSError'>\n[cuda12-llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n[cuda12-rerankers]: failed to load model with internal loader: could not load model (no success): Unexpected err=OSError(\"qwen3-30b-a3b-fast.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`\"), type(err)=<class 'OSError'>\n[localai@llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" ip=10.40.1.3 latency=25.018143589s method=POST status=500 url=/v1/chat/completions

```

Relevant error: `140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

model loading crash using kv-overrides #5745 with llama cuda backend #6643

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

model loading crash using kv-overrides #5745 with llama cuda backend #6643

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions