Qwen3 Reranker: rankAndSort() returns NaN/null scores

## Bug Description

`LlamaRankingContext.rank()` returns `NaN` and `rankAndSort()` returns `null` scores for all documents when using the Qwen3 Reranker model (`ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF`).

## Root Cause

The Qwen3 Reranker uses a **thinking-based reranking template** that expects the model to generate tokens (think + yes/no answer), not produce a single embedding logit:

```
<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: Given a web search query, retrieve relevant passages that answer the query
<Query>: {query}
<Document>: {document}<|im_end|>
<|im_start|>assistant
<think>

</think>
```

However, `_evaluateRankingForInput()` in `LlamaRankingContext` only runs one token and reads the embedding logit:

```js
const embedding = this._llamaContext._ctx.getEmbedding(input.length, 1);
const logit = embedding[0];
const probability = logitToSigmoid(logit);
```

This works for traditional cross-encoder rerankers (like jina-reranker-v2) that produce a relevance logit directly, but Qwen3's architecture requires generating the `<think>...</think>` reasoning and then a "yes"/"no" token, using the logprobs of "yes" vs "no" as the relevance score.

## Reproduction

```ts
import { getLlama } from "node-llama-cpp";

const llama = await getLlama();
const model = await llama.loadModel({
  modelPath: "hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf"
});

const ctx = await model.createRankingContext({ contextSize: 4096 });

const score = await ctx.rank("What are projects?", "Projects use NNN naming convention");
console.log(score); // NaN

const sorted = await ctx.rankAndSort("test query", ["relevant doc", "irrelevant doc"]);
console.log(sorted); // [{ document: "...", score: null }, { document: "...", score: null }]

await ctx.dispose();
await model.dispose();
await llama.dispose();
```

## Expected Behavior

For Qwen3-style rerankers, the implementation should:
1. Generate tokens until the model produces "yes" or "no" (after the `<think>` block)
2. Use the logprob of the "yes" token as the relevance score (as described in the [Qwen3 Reranker documentation](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B))

## Environment

- node-llama-cpp: 3.15.1
- Model: `hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf`
- OS: macOS (Apple Silicon)
- Runtime: Bun 1.3.9

## Related

- llama.cpp #16407 — similar broken scores via llama-server
- llama.cpp #17743 — empty output from Qwen3 reranker
- This model is the default reranker in [QMD](https://github.com/tobi/qmd), affecting all QMD users' `query` command


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3 Reranker: rankAndSort() returns NaN/null scores #550

Bug Description

Root Cause

Reproduction

Expected Behavior

Environment

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Qwen3 Reranker: rankAndSort() returns NaN/null scores #550

Description

Bug Description

Root Cause

Reproduction

Expected Behavior

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions