Bug Description
LlamaRankingContext.rank() returns NaN and rankAndSort() returns null scores for all documents when using the Qwen3 Reranker model (ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF).
Root Cause
The Qwen3 Reranker uses a thinking-based reranking template that expects the model to generate tokens (think + yes/no answer), not produce a single embedding logit:
<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: Given a web search query, retrieve relevant passages that answer the query
<Query>: {query}
<Document>: {document}<|im_end|>
<|im_start|>assistant
<think>
</think>
However, _evaluateRankingForInput() in LlamaRankingContext only runs one token and reads the embedding logit:
const embedding = this._llamaContext._ctx.getEmbedding(input.length, 1);
const logit = embedding[0];
const probability = logitToSigmoid(logit);
This works for traditional cross-encoder rerankers (like jina-reranker-v2) that produce a relevance logit directly, but Qwen3's architecture requires generating the <think>...</think> reasoning and then a "yes"/"no" token, using the logprobs of "yes" vs "no" as the relevance score.
Reproduction
import { getLlama } from "node-llama-cpp";
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf"
});
const ctx = await model.createRankingContext({ contextSize: 4096 });
const score = await ctx.rank("What are projects?", "Projects use NNN naming convention");
console.log(score); // NaN
const sorted = await ctx.rankAndSort("test query", ["relevant doc", "irrelevant doc"]);
console.log(sorted); // [{ document: "...", score: null }, { document: "...", score: null }]
await ctx.dispose();
await model.dispose();
await llama.dispose();
Expected Behavior
For Qwen3-style rerankers, the implementation should:
- Generate tokens until the model produces "yes" or "no" (after the
<think> block)
- Use the logprob of the "yes" token as the relevance score (as described in the Qwen3 Reranker documentation)
Environment
- node-llama-cpp: 3.15.1
- Model:
hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf
- OS: macOS (Apple Silicon)
- Runtime: Bun 1.3.9
Related
- llama.cpp #16407 — similar broken scores via llama-server
- llama.cpp #17743 — empty output from Qwen3 reranker
- This model is the default reranker in QMD, affecting all QMD users'
query command
Bug Description
LlamaRankingContext.rank()returnsNaNandrankAndSort()returnsnullscores for all documents when using the Qwen3 Reranker model (ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF).Root Cause
The Qwen3 Reranker uses a thinking-based reranking template that expects the model to generate tokens (think + yes/no answer), not produce a single embedding logit:
However,
_evaluateRankingForInput()inLlamaRankingContextonly runs one token and reads the embedding logit:This works for traditional cross-encoder rerankers (like jina-reranker-v2) that produce a relevance logit directly, but Qwen3's architecture requires generating the
<think>...</think>reasoning and then a "yes"/"no" token, using the logprobs of "yes" vs "no" as the relevance score.Reproduction
Expected Behavior
For Qwen3-style rerankers, the implementation should:
<think>block)Environment
hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.ggufRelated
querycommand