Skip to content

Eval bug: Nemotron 3 Nano crashes on CPU-only with GGML_ASSERT(*cur_backend_id != -1) failed #18099

@great1cornholio

Description

@great1cornholio

Name and Version

build: 7435 (79dbae0) with GNU 14.2.0 for Linux x86_64

Operating systems

Linux

GGML backends

CPU

Hardware

  • CPU: Xeon Gold 6428N with AMX support (AVX512, AMX_INT8, AMX_TILE)
  • RAM: 128GB
  • GPU: None (CPU-only)

Models

  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-UD-Q6_K_XL.gguf
  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf
  • unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-Q8_0.gguf

Problem description & steps to reproduce

What happened?

Nemotron 3 Nano crashes during context initialization on CPU-only system.

Error

/opt/llama.cpp/ggml/src/ggml-backend.cpp:1149: GGML_ASSERT(*cur_backend_id != -1) failed

System Info

  • OS: Debian 13 6.12.57+deb13-amd64

Build Configuration

cmake -B build -DCMAKE_BUILD_TYPE=Release \
  -DGGML_NATIVE=ON \
  -DGGML_AMX_TILE=ON \
  -DGGML_AMX_INT8=ON \
  -DGGML_CUDA=OFF
cmake --build build --config Release -j$(nproc)

Command

./build/bin/llama-server \
  -m /opt/llama.cpp/models/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  -c 8192 -t 16 --host 0.0.0.0 --port 8082

Expected Behavior

According to PR #18058, model should work on CPU. Model loads successfully but crashes during graph scheduling.

Notes

  • Other models work fine (eg. GPT-OSS 20B GGUF)
  • Crash occurs in ggml_backend_sched_split_graph - backend scheduler cannot assign operations
  • Does hybrid Mamba-Transformer MoE require GPU for certain ops?

First Bad Commit

No response

Relevant log output

Dec 16 13:32:45 yolops-102 systemd[1]: Started llama-server.service - Llama.cpp Server.
Dec 16 13:32:45 yolops-102 llama-server[400962]: main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
Dec 16 13:32:45 yolops-102 llama-server[400962]: build: 7435 (79dbae034) with GNU 14.2.0 for Linux x86_64
Dec 16 13:32:45 yolops-102 llama-server[400962]: system info: n_threads = 16, n_threads_batch = 16, total_threads = 64
Dec 16 13:32:45 yolops-102 llama-server[400962]: system_info: n_threads = 16 (n_threads_batch = 16) / 64 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Dec 16 13:32:45 yolops-102 llama-server[400962]: init: using 63 threads for HTTP server
Dec 16 13:32:45 yolops-102 llama-server[400962]: start: binding port with default address family
Dec 16 13:32:45 yolops-102 llama-server[400962]: main: loading model
Dec 16 13:32:45 yolops-102 llama-server[400962]: srv    load_model: loading model '/opt/llama.cpp/models/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf'
Dec 16 13:32:45 yolops-102 llama-server[400962]: common_init_result: fitting params to device memory, to report bugs during this step use -fit off (or --verbose if you can't)
Dec 16 13:32:46 yolops-102 llama-server[400962]: /opt/llama.cpp/ggml/src/ggml-backend.cpp:1149: GGML_ASSERT(*cur_backend_id != -1) failed
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(+0x149a5) [0x7fcf2f4b39a5]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7fcf2f4b3d6f]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7fcf2f4b3efe]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_split_graph+0x21f4) [0x7fcf2f4cd9e4]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm+0x46c) [0x7fcf2f29f7cc]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(_ZN13llama_contextC2ERK11llama_model20llama_context_params+0x1aa8) [0x7fcf2f2a2de8]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(llama_init_from_model+0x106) [0x7fcf2f2a3426]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(+0x7d7f5) [0x7fcf2f27d7f5]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(+0x7ea2b) [0x7fcf2f27ea2b]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/libllama.so.0(llama_params_fit+0x4e) [0x7fcf2f2820fe]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x263f0c) [0x5590d6208f0c]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x2666d9) [0x5590d620b6d9]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x154463) [0x5590d60f9463]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x8c20b) [0x5590d603120b]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fcf2ec33ca8]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fcf2ec33d65]
Dec 16 13:32:46 yolops-102 llama-server[401029]: /opt/llama.cpp/build/bin/llama-server(+0x8eb21) [0x5590d6033b21]
Dec 16 13:32:46 yolops-102 systemd[1]: llama-server.service: Main process exited, code=killed, status=6/ABRT
Dec 16 13:32:46 yolops-102 systemd[1]: llama-server.service: Failed with result 'signal'.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions