Skip to content

model loading crash using kv-overrides #5745 with llama cuda backend #6643

@blob42

Description

@blob42

LocalAI version:

Commit: 47b2a50

Environment, CPU architecture, OS, and Version:

  • Linux sihil 6.14.4-zen1-1-zen ZEN SMP PREEMPT_DYNAMIC Sat, 26 Apr 2025 00:06:55 +0000 x86_64 GNU/Linux
  • Running in docker with nvidia-runtiume container
  • Cuda version: 12.8
  • Llama Cuda backend installed from the web UI backends interface

Describe the bug

I want to set a custom number of expert on a Qwen3 MoE model (qwen3 30b a3b) using llama.cpp backend with the kv-override options.

I tried using the feature implemented in #5745 like following:

name: qwen3-test
parameters:
  model: qwen3-30b-a3b-fast.gguf
  backend: cuda12-llama-cpp
  temperature: 0.7
  top_k: 20
  top_p: 0.8
  min_p: 0
  repeat_penality: 1.05

overrides:
  # any nb of experts produces a crash
  - "qwen3moe.expert_used_count=int:8"

context_size: 8192

I used the same number of experts as the model's default for debugging.
Loading the model crashes with a stacktrace no matter how many experts I set.

The relevant error from the log is 140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed

Refers to this line in Llama.cpp

To Reproduce

  • Load the model with any chat completion call

Expected behavior

  • The model should load using the custom llama.cpp kv-overrides

Logs

Model loading bug

localai-1  | 2:27PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc000857958} sizeCache:0 unknownFields:[] Model:qwen3-30b-a3b-fast.gguf ContextSize:8192 Seed:541601836 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:15 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/qwen3-30b-a3b-fast.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath:/models LoraAdapters:[] LoraScales:[] Options:[gpu] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[qwen3moe.expert_used_count=int:8]}
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr build: 6673 (d64c8104) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr system info: n_threads = 15, n_threads_batch = -1, total_threads = 32
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: found 2 CUDA devices:
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr   Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /LocalAI/backend/cpp/llama-cpp-avx2-build/llama.cpp/common/common.cpp:1140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr system_info: n_threads = 15 / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr srv    load_model: loading model '/models/qwen3-30b-a3b-fast.gguf'
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4518b)[0x756ac0c4518b]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4574f)[0x756ac0c4574f]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4591e)[0x756ac0c4591e]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x287132)[0x756ac0087132]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x28c16c)[0x756ac008c16c]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x1964db)[0x756abff964db]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x1e6310)[0x756abffe6310]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x48165e)[0x756ac028165e]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x49fc28)[0x756ac029fc28]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4a7049)[0x756ac02a7049]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e3d36)[0x756ac02e3d36]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e3ae3)[0x756ac02e3ae3]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e4521)[0x756ac02e4521]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x502da7)[0x756ac0302da7]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x5021b3)[0x756ac03021b3]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x501fee)[0x756ac0301fee]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x502012)[0x756ac0302012]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x17e65ee)[0x756ac15e65ee]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x17e663b)[0x756ac15e663b]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/lib/libc.so.6(+0x94ac3)[0x756ab4494ac3]
localai-1  | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/lib/libc.so.6(clone+0x44)[0x756ab4525a74]
localai-1  | 2:27PM ERR Failed to load model qwen3-test with backend localai@llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" modelID=qwen3-test
localai-1  | 2:27PM INF [localai@llama-cpp] Fails: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
localai-1  | 2:27PM ERR Server error error="could not load model - all backends returned error: [llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n[rerankers]: failed to load model with internal loader: could not load model (no success): Unexpected err=OSError(\"qwen3-30b-a3b-fast.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`\"), type(err)=<class 'OSError'>\n[cuda12-llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n[cuda12-rerankers]: failed to load model with internal loader: could not load model (no success): Unexpected err=OSError(\"qwen3-30b-a3b-fast.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`\"), type(err)=<class 'OSError'>\n[localai@llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" ip=10.40.1.3 latency=25.018143589s method=POST status=500 url=/v1/chat/completions

Relevant error: 140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions