Name and Version
build: 7435 (79dbae0) with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CPU
Hardware
- CPU: Xeon Gold 6428N with AMX support (AVX512, AMX_INT8, AMX_TILE)
- RAM: 128GB
- GPU: None (CPU-only)
Models
- unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-UD-Q6_K_XL.gguf
- unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf
- unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-Q8_0.gguf
Problem description & steps to reproduce
What happened?
Nemotron 3 Nano crashes during context initialization on CPU-only system.
Error
/opt/llama.cpp/ggml/src/ggml-backend.cpp:1149: GGML_ASSERT(*cur_backend_id != -1) failed
System Info
- OS: Debian 13 6.12.57+deb13-amd64
Build Configuration
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DGGML_NATIVE=ON \
-DGGML_AMX_TILE=ON \
-DGGML_AMX_INT8=ON \
-DGGML_CUDA=OFF
cmake --build build --config Release -j$(nproc)
Command
./build/bin/llama-server \
-m /opt/llama.cpp/models/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
-c 8192 -t 16 --host 0.0.0.0 --port 8082
Expected Behavior
According to PR #18058, model should work on CPU. Model loads successfully but crashes during graph scheduling.
Notes
- Other models work fine (eg. GPT-OSS 20B GGUF)
- Crash occurs in
ggml_backend_sched_split_graph - backend scheduler cannot assign operations
- Does hybrid Mamba-Transformer MoE require GPU for certain ops?
First Bad Commit
No response
Relevant log output
Dec 16 13:32:45 yolops-102 systemd[1]: Started llama-server.service - Llama.cpp Server.
Dec 16 13:32:45 yolops-102 llama-server[400962]: main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
Dec 16 13:32:45 yolops-102 llama-server[400962]: build: 7435 (79dbae034) with GNU 14.2.0 for Linux x86_64
Dec 16 13:32:45 yolops-102 llama-server[400962]: system info: n_threads = 16, n_threads_batch = 16, total_threads = 64
Dec 16 13:32:45 yolops-102 llama-server[400962]: system_info: n_threads = 16 (n_threads_batch = 16) / 64 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Dec 16 13:32:45 yolops-102 llama-server[400962]: init: using 63 threads for HTTP server
Dec 16 13:32:45 yolops-102 llama-server[400962]: start: binding port with default address family
Dec 16 13:32:45 yolops-102 llama-server[400962]: main: loading model
Dec 16 13:32:45 yolops-102 llama-server[400962]: srv load_model: loading model '/opt/llama.cpp/models/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf'
Dec 16 13:32:45 yolops-102 llama-server[400962]: common_init_result: fitting params to device memory, to report bugs during this step use -fit off (or --verbose if you can't)
Dec 16 13:32:46 yolops-102 llama-server[400962]: /opt/llama.cpp/ggml/src/ggml-backend.cpp:1149: GGML_ASSERT(*cur_backend_id != -1) failed
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(+0x149a5) [0x7fcf2f4b39a5]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7fcf2f4b3d6f]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7fcf2f4b3efe]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_split_graph+0x21f4) [0x7fcf2f4cd9e4]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm+0x46c) [0x7fcf2f29f7cc]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(_ZN13llama_contextC2ERK11llama_model20llama_context_params+0x1aa8) [0x7fcf2f2a2de8]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(llama_init_from_model+0x106) [0x7fcf2f2a3426]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(+0x7d7f5) [0x7fcf2f27d7f5]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(+0x7ea2b) [0x7fcf2f27ea2b]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(llama_params_fit+0x4e) [0x7fcf2f2820fe]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x263f0c) [0x5590d6208f0c]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x2666d9) [0x5590d620b6d9]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x154463) [0x5590d60f9463]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x8c20b) [0x5590d603120b]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fcf2ec33ca8]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fcf2ec33d65]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x8eb21) [0x5590d6033b21]
Dec 16 13:32:46 yolops-102 systemd[1]: llama-server.service: Main process exited, code=killed, status=6/ABRT
Dec 16 13:32:46 yolops-102 systemd[1]: llama-server.service: Failed with result 'signal'.
Name and Version
build: 7435 (79dbae0) with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CPU
Hardware
Models
Problem description & steps to reproduce
What happened?
Nemotron 3 Nano crashes during context initialization on CPU-only system.
Error
System Info
Build Configuration
cmake -B build -DCMAKE_BUILD_TYPE=Release \ -DGGML_NATIVE=ON \ -DGGML_AMX_TILE=ON \ -DGGML_AMX_INT8=ON \ -DGGML_CUDA=OFF cmake --build build --config Release -j$(nproc)Command
Expected Behavior
According to PR #18058, model should work on CPU. Model loads successfully but crashes during graph scheduling.
Notes
ggml_backend_sched_split_graph- backend scheduler cannot assign operationsFirst Bad Commit
No response
Relevant log output