Skip to content

Qwen3 Reranker: rankAndSort() returns NaN/null scores #550

@jesper-bylund

Description

@jesper-bylund

Bug Description

LlamaRankingContext.rank() returns NaN and rankAndSort() returns null scores for all documents when using the Qwen3 Reranker model (ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF).

Root Cause

The Qwen3 Reranker uses a thinking-based reranking template that expects the model to generate tokens (think + yes/no answer), not produce a single embedding logit:

<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: Given a web search query, retrieve relevant passages that answer the query
<Query>: {query}
<Document>: {document}<|im_end|>
<|im_start|>assistant
<think>

</think>

However, _evaluateRankingForInput() in LlamaRankingContext only runs one token and reads the embedding logit:

const embedding = this._llamaContext._ctx.getEmbedding(input.length, 1);
const logit = embedding[0];
const probability = logitToSigmoid(logit);

This works for traditional cross-encoder rerankers (like jina-reranker-v2) that produce a relevance logit directly, but Qwen3's architecture requires generating the <think>...</think> reasoning and then a "yes"/"no" token, using the logprobs of "yes" vs "no" as the relevance score.

Reproduction

import { getLlama } from "node-llama-cpp";

const llama = await getLlama();
const model = await llama.loadModel({
  modelPath: "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf"
});

const ctx = await model.createRankingContext({ contextSize: 4096 });

const score = await ctx.rank("What are projects?", "Projects use NNN naming convention");
console.log(score); // NaN

const sorted = await ctx.rankAndSort("test query", ["relevant doc", "irrelevant doc"]);
console.log(sorted); // [{ document: "...", score: null }, { document: "...", score: null }]

await ctx.dispose();
await model.dispose();
await llama.dispose();

Expected Behavior

For Qwen3-style rerankers, the implementation should:

  1. Generate tokens until the model produces "yes" or "no" (after the <think> block)
  2. Use the logprob of the "yes" token as the relevance score (as described in the Qwen3 Reranker documentation)

Environment

  • node-llama-cpp: 3.15.1
  • Model: hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf
  • OS: macOS (Apple Silicon)
  • Runtime: Bun 1.3.9

Related

  • llama.cpp #16407 — similar broken scores via llama-server
  • llama.cpp #17743 — empty output from Qwen3 reranker
  • This model is the default reranker in QMD, affecting all QMD users' query command

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions