ruvllm: GGUF weight loader only supports Llama architecture, not Qwen2/Phi/Gemma

## Problem

The `candle_backend.rs` GGUF loader correctly reads metadata keys for Qwen2, Phi, and Gemma architectures (embedding_length, block_count, head_count etc.), but the actual weight loading on line 693 always uses `qlama::ModelWeights::from_gguf()` — which only works for Llama-architecture models.

This means the README's "Supported Models" table lists Qwen 2.5, Phi-3, and Gemma-2 as supported, but they all fail with:

```
Model config: hidden=2048, layers=36, heads=16, kv_heads=2, vocab=32000
Model load failed: Model error: Failed to load GGUF weights: cannot find llama.attention.head_count in metadata
```

(The metadata error is because the GGUF file uses `qwen2.attention.head_count` instead of `llama.attention.head_count`.)

## Root Cause

`crates/ruvllm/src/backends/candle_backend.rs:693`:
```rust
let model_weights =
    qlama::ModelWeights::from_gguf(gguf_content, &mut file, &self.device)
```

This hardcodes the Llama weight loader regardless of the detected architecture. Qwen2 uses different tensor naming conventions (`model.layers.N.self_attn.q_proj` vs `blk.N.attn_q`).

## Expected Behavior

The loader should detect the GGUF architecture metadata (`general.architecture`) and dispatch to the appropriate weight loader:
- `llama` → `qlama::ModelWeights`
- `qwen2` → qwen2-specific weight mapper
- `phi` → phi-specific weight mapper
- `gemma` → gemma-specific weight mapper

## Environment

- ruvllm v2.1.0 (built from main @ 50b4bb5)
- macOS aarch64 (Apple M1, 16GB)
- Tested with: `Qwen/Qwen2.5-3B-Instruct-GGUF` (qwen2.5-3b-instruct-q4_k_m.gguf, 2.1GB)

## Partial Fix (metadata keys only)

I've added `qwen2.*` and `gemma.*` metadata keys to the config extraction (lines 560-662) but the weight loading still fails because `qlama::ModelWeights::from_gguf` expects Llama tensor names.

## Impact

Users who try to run Qwen, Phi, or Gemma models via `ruvllm serve/benchmark/chat` get silent fallback to mock mode with fake inference results. This is misleading — the benchmark reports ~500K tok/s which is clearly mock data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruvllm: GGUF weight loader only supports Llama architecture, not Qwen2/Phi/Gemma #324

Problem

Root Cause

Expected Behavior

Environment

Partial Fix (metadata keys only)

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ruvllm: GGUF weight loader only supports Llama architecture, not Qwen2/Phi/Gemma #324

Description

Problem

Root Cause

Expected Behavior

Environment

Partial Fix (metadata keys only)

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions