Add support for Qwen3-Reranker#15824
Conversation
|
Great - will take a look tomorrow. Would be useful to add a basic usage example in the OP of this PR. |
|
Yup, will add a usage example up above. Actually encountering some numerical differences comparing the output here to the |
9d73260 to
22dd428
Compare
|
Ok, finally fixed it! Now we have numerical parity with the HF implementation. It turned out to be a small difference in the chat template. Should be ready for review @ggerganov. |
| bool last = cparams.pooling_type == LLAMA_POOLING_TYPE_LAST; | ||
| const bool last = ( | ||
| cparams.pooling_type == LLAMA_POOLING_TYPE_LAST || | ||
| (cparams.pooling_type == LLAMA_POOLING_TYPE_RANK && arch == LLM_ARCH_QWEN3) // qwen3 reranking & embedding models use last token |
There was a problem hiding this comment.
I am wondering if it makes sense to remove pooling type RANK all together from libllama? Do you have any thoughts about if having a separate pooling class RANK is really necessary?
There was a problem hiding this comment.
I think you could get really close to merging RANK with LAST. The main differentiator is in llm_graph_context::build_pooling where you apply cls_out to map from the last token of the last hidden state to the classification output (usual yes/no). Unlike the other pooling types, you actually need knowledge of the model weights to do the calculation.
Backport upstream commit b5bd037 ("llama : add support for qwen3 reranker ggml-org#15824") to b6440, the last version before the Metal async backend bug (b6441+) that crashes embedding/reranker models on Apple Silicon. Changes: - Add cls.output tensor to qwen3 arch definition - Load cls_out classification head in qwen3 model loader - Support RANK pooling with only cls_out (no cls required) - Use last-token pooling for qwen3 RANK mode - Add softmax output for qwen3 reranker - Use rerank chat template when available (skip SEP token requirement) - Add one-click deployment scripts for Embedding and Reranker models Tested on Apple M1 Pro with: - Qwen3-Embedding-0.6B-Q8_0 (embedding, port 8080) - Qwen3-Reranker-4B-Q4_K_M (reranker, port 8082) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add support for Qwen3 reranking models. This is largely based on #14029 by @ngxson, with a few tweaks to reflect changes to the codebase in the interim.
This hardcodes the chat template provided by in the
README.md, which I'm assuming is the intended usage. If folks want to be able to change that, then we'd need a new CLI option. The template uses string substitution rather than jinja, as it seems like jinja is only used for chat messages.Edit: Here's an example usage similar to that used in the official Qwen repo. Note that
\tseparates queries from documents and\nseparates different prompts.build/bin/llama-embedding -m qwen3-reranker-0.6b-f32.gguf --embd-normalize -1 -p "What is the capital of China?\tThe capital of China is Beijing.\nExplain gravity\tGravity is a force that attracts two bodies towards each other."Notice that we need to pass
--embd-normalize -1to disable normalization (the default is L2 norm).