-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
LocalAI version:
Commit: 47b2a50
Environment, CPU architecture, OS, and Version:
- Linux sihil 6.14.4-zen1-1-zen ZEN SMP PREEMPT_DYNAMIC Sat, 26 Apr 2025 00:06:55 +0000 x86_64 GNU/Linux
- Running in docker with nvidia-runtiume container
- Cuda version: 12.8
- Llama Cuda backend installed from the web UI backends interface
Describe the bug
I want to set a custom number of expert on a Qwen3 MoE model (qwen3 30b a3b) using llama.cpp backend with the kv-override options.
I tried using the feature implemented in #5745 like following:
name: qwen3-test
parameters:
model: qwen3-30b-a3b-fast.gguf
backend: cuda12-llama-cpp
temperature: 0.7
top_k: 20
top_p: 0.8
min_p: 0
repeat_penality: 1.05
overrides:
# any nb of experts produces a crash
- "qwen3moe.expert_used_count=int:8"
context_size: 8192I used the same number of experts as the model's default for debugging.
Loading the model crashes with a stacktrace no matter how many experts I set.
The relevant error from the log is 140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed
Refers to this line in Llama.cpp
To Reproduce
- Load the model with any chat completion call
Expected behavior
- The model should load using the custom llama.cpp kv-overrides
Logs
Model loading bug
localai-1 | 2:27PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc000857958} sizeCache:0 unknownFields:[] Model:qwen3-30b-a3b-fast.gguf ContextSize:8192 Seed:541601836 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:15 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/qwen3-30b-a3b-fast.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath:/models LoraAdapters:[] LoraScales:[] Options:[gpu] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[qwen3moe.expert_used_count=int:8]}
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr build: 6673 (d64c8104) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr system info: n_threads = 15, n_threads_batch = -1, total_threads = 32
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr ggml_cuda_init: found 2 CUDA devices:
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /LocalAI/backend/cpp/llama-cpp-avx2-build/llama.cpp/common/common.cpp:1140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr system_info: n_threads = 15 / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr srv load_model: loading model '/models/qwen3-30b-a3b-fast.gguf'
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4518b)[0x756ac0c4518b]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4574f)[0x756ac0c4574f]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0xe4591e)[0x756ac0c4591e]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x287132)[0x756ac0087132]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x28c16c)[0x756ac008c16c]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x1964db)[0x756abff964db]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x1e6310)[0x756abffe6310]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x48165e)[0x756ac028165e]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x49fc28)[0x756ac029fc28]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4a7049)[0x756ac02a7049]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e3d36)[0x756ac02e3d36]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e3ae3)[0x756ac02e3ae3]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x4e4521)[0x756ac02e4521]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x502da7)[0x756ac0302da7]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x5021b3)[0x756ac03021b3]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x501fee)[0x756ac0301fee]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x502012)[0x756ac0302012]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x17e65ee)[0x756ac15e65ee]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/llama-cpp-avx2(+0x17e663b)[0x756ac15e663b]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/lib/libc.so.6(+0x94ac3)[0x756ab4494ac3]
localai-1 | 2:27PM DBG GRPC(qwen3-test-127.0.0.1:40095): stderr /backends/cuda12-llama-cpp/lib/libc.so.6(clone+0x44)[0x756ab4525a74]
localai-1 | 2:27PM ERR Failed to load model qwen3-test with backend localai@llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" modelID=qwen3-test
localai-1 | 2:27PM INF [localai@llama-cpp] Fails: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
localai-1 | 2:27PM ERR Server error error="could not load model - all backends returned error: [llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n[rerankers]: failed to load model with internal loader: could not load model (no success): Unexpected err=OSError(\"qwen3-30b-a3b-fast.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`\"), type(err)=<class 'OSError'>\n[cuda12-llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF\n[cuda12-rerankers]: failed to load model with internal loader: could not load model (no success): Unexpected err=OSError(\"qwen3-30b-a3b-fast.gguf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`\"), type(err)=<class 'OSError'>\n[localai@llama-cpp]: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" ip=10.40.1.3 latency=25.018143589s method=POST status=500 url=/v1/chat/completions
Relevant error: 140: GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key") failed