Problem
The candle_backend.rs GGUF loader correctly reads metadata keys for Qwen2, Phi, and Gemma architectures (embedding_length, block_count, head_count etc.), but the actual weight loading on line 693 always uses qlama::ModelWeights::from_gguf() — which only works for Llama-architecture models.
This means the README's "Supported Models" table lists Qwen 2.5, Phi-3, and Gemma-2 as supported, but they all fail with:
Model config: hidden=2048, layers=36, heads=16, kv_heads=2, vocab=32000
Model load failed: Model error: Failed to load GGUF weights: cannot find llama.attention.head_count in metadata
(The metadata error is because the GGUF file uses qwen2.attention.head_count instead of llama.attention.head_count.)
Root Cause
crates/ruvllm/src/backends/candle_backend.rs:693:
let model_weights =
qlama::ModelWeights::from_gguf(gguf_content, &mut file, &self.device)
This hardcodes the Llama weight loader regardless of the detected architecture. Qwen2 uses different tensor naming conventions (model.layers.N.self_attn.q_proj vs blk.N.attn_q).
Expected Behavior
The loader should detect the GGUF architecture metadata (general.architecture) and dispatch to the appropriate weight loader:
llama → qlama::ModelWeights
qwen2 → qwen2-specific weight mapper
phi → phi-specific weight mapper
gemma → gemma-specific weight mapper
Environment
- ruvllm v2.1.0 (built from main @ 50b4bb5)
- macOS aarch64 (Apple M1, 16GB)
- Tested with:
Qwen/Qwen2.5-3B-Instruct-GGUF (qwen2.5-3b-instruct-q4_k_m.gguf, 2.1GB)
Partial Fix (metadata keys only)
I've added qwen2.* and gemma.* metadata keys to the config extraction (lines 560-662) but the weight loading still fails because qlama::ModelWeights::from_gguf expects Llama tensor names.
Impact
Users who try to run Qwen, Phi, or Gemma models via ruvllm serve/benchmark/chat get silent fallback to mock mode with fake inference results. This is misleading — the benchmark reports ~500K tok/s which is clearly mock data.
Problem
The
candle_backend.rsGGUF loader correctly reads metadata keys for Qwen2, Phi, and Gemma architectures (embedding_length, block_count, head_count etc.), but the actual weight loading on line 693 always usesqlama::ModelWeights::from_gguf()— which only works for Llama-architecture models.This means the README's "Supported Models" table lists Qwen 2.5, Phi-3, and Gemma-2 as supported, but they all fail with:
(The metadata error is because the GGUF file uses
qwen2.attention.head_countinstead ofllama.attention.head_count.)Root Cause
crates/ruvllm/src/backends/candle_backend.rs:693:This hardcodes the Llama weight loader regardless of the detected architecture. Qwen2 uses different tensor naming conventions (
model.layers.N.self_attn.q_projvsblk.N.attn_q).Expected Behavior
The loader should detect the GGUF architecture metadata (
general.architecture) and dispatch to the appropriate weight loader:llama→qlama::ModelWeightsqwen2→ qwen2-specific weight mapperphi→ phi-specific weight mappergemma→ gemma-specific weight mapperEnvironment
Qwen/Qwen2.5-3B-Instruct-GGUF(qwen2.5-3b-instruct-q4_k_m.gguf, 2.1GB)Partial Fix (metadata keys only)
I've added
qwen2.*andgemma.*metadata keys to the config extraction (lines 560-662) but the weight loading still fails becauseqlama::ModelWeights::from_ggufexpects Llama tensor names.Impact
Users who try to run Qwen, Phi, or Gemma models via
ruvllm serve/benchmark/chatget silent fallback to mock mode with fake inference results. This is misleading — the benchmark reports ~500K tok/s which is clearly mock data.