Skip to content

Conversation

@renovate
Copy link
Contributor

@renovate renovate bot commented Jul 1, 2025

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
sentence-transformers ^3.2.1 -> ^5.0.0 age adoption passing confidence

Release Notes

UKPLab/sentence-transformers (sentence-transformers)

v5.0.0: - SparseEncoder support; encode_query & encode_document; multi-processing in encode; Router; and more

Compare Source

This release consists of significant updates including the introduction of Sparse Encoder models, new methods encode_query and encode_document, multi-processing support in encode, the Router module for asymmetric models, custom learning rates for parameter groups, composite loss logging, and various small improvements and bug fixes.

Install this version with

### Training + Inference
pip install sentence-transformers[train]==5.0.0

### Inference only, use one of:
pip install sentence-transformers==5.0.0
pip install sentence-transformers[onnx-gpu]==5.0.0
pip install sentence-transformers[onnx]==5.0.0
pip install sentence-transformers[openvino]==5.0.0

[!TIP]
Our Training and Finetuning Sparse Embedding Models with Sentence Transformers v5 blogpost is an excellent place to learn about finetuning sparse embedding models!

[!NOTE]
This release is designed to be fully backwards compatible, meaning that you should be able to upgrade from older versions to v5.x without any issues. If you are running into issues when upgrading, feel free to open an issue. Also see the Migration Guide for changes that we would recommend.

Sparse Encoder models

The Sentence Transformers v5.0 release introduces Sparse Embedding models, also known as Sparse Encoders. These models generate high-dimensional embeddings, often with 30,000+ dimensions, where often only <1% of dimensions are non-zero. This is in contrast to the standard dense embedding models, which produce low-dimensional embeddings (e.g., 384, 768, or 1024 dimensions) where all values are non-zero.

Usually, each active dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a specific token in the model's vocabulary, allowing for interpretability. This means that you can e.g. see exactly which words/tokens are important in an embedding, and that you can inspect exactly because of which words/tokens two texts are deemed similar.

Let's have a look at naver/splade-v3, a strong sparse embedding model, as an example:

from sentence_transformers import SparseEncoder

### Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")

### Run inference
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)

### (3, 30522)
### Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)

### tensor([[   32.4323,     5.8528,     0.0258],
###         [    5.8528,    26.6649,     0.0302],

###         [    0.0258,     0.0302,    24.0839]])
### Let's decode our embeddings to be able to interpret them
decoded = model.decode(embeddings, top_k=10)
for decoded, sentence in zip(decoded, sentences):
    print(f"Sentence: {sentence}")
    print(f"Decoded: {decoded}")
    print()
Sentence: The weather is lovely today.
Decoded: [('weather', 2.754288673400879), ('today', 2.610959529876709), ('lovely', 2.431990623474121), ('currently', 1.5520408153533936), ('beautiful', 1.5046082735061646), ('cool', 1.4664798974990845), ('pretty', 0.8986214995384216), ('yesterday', 0.8603134155273438), ('nice', 0.8322536945343018), ('summer', 0.7702118158340454)]

Sentence: It's so sunny outside!
Decoded: [('outside', 2.6939032077789307), ('sunny', 2.535827398300171), ('so', 2.0600898265838623), ('out', 1.5397940874099731), ('weather', 1.1198079586029053), ('very', 0.9873268604278564), ('cool', 0.9406591057777405), ('it', 0.9026399254798889), ('summer', 0.684999406337738), ('sun', 0.6520509123802185)]

Sentence: He drove to the stadium.
Decoded: [('stadium', 2.7872302532196045), ('drove', 1.8208855390548706), ('driving', 1.6665740013122559), ('drive', 1.5565159320831299), ('he', 1.4721972942352295), ('stadiums', 1.449463129043579), ('to', 1.0441515445709229), ('car', 0.7002660632133484), ('visit', 0.5118278861045837), ('football', 0.502326250076294)]

In this example, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token in the model's vocabulary. The decode method returned the top 10 tokens with the highest values in the embedding, allowing us to interpret which tokens contribute most to the embedding.

We can even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:

### Let's also compute the intersection/overlap of the first two embeddings
intersection_embedding = model.intersection(embeddings[0], embeddings[1])
decoded_intersection = model.decode(intersection_embedding)
print(decoded_intersection)
Decoded: [('weather', 3.0842742919921875), ('cool', 1.379457712173462), ('summer', 0.5275946259498596), ('comfort', 0.3239051103591919), ('sally', 0.22571465373039246), ('julian', 0.14787325263023376), ('nature', 0.08582140505313873), ('beauty', 0.0588383711874485), ('mood', 0.018594780936837196), ('nathan', 0.000752730411477387)]

And if we think the embeddings are too big, we can limit the maximum number of active dimensions like so:

from sentence_transformers import SparseEncoder

### Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")  # You can also set max_active_dims here instead of encode()

### Run inference
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
embeddings = model.encode_document(documents, max_active_dims=64)
print(embeddings.shape)

### (3, 30522)
### Print the sparsity of the embeddings
sparsity = model.sparsity(embeddings)
print(sparsity)

### {'active_dims': 64.0, 'sparsity_ratio': 0.9979031518249132}
Click to see that it has minimal impact on scores
from sentence_transformers import SparseEncoder

### Download from the 🤗 Hub
model = SparseEncoder("naver/splade-v3")  # You can also set max_active_dims here instead of encode()

### Run inference
queries = ["what causes aging fast"]
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)

### Determine the sparsity
query_sparsity = model.sparsity(query_embeddings)
document_sparsity = model.sparsity(document_embeddings)
print(query_sparsity, document_sparsity)

### {'active_dims': 28.0, 'sparsity_ratio': 0.9990826289233995} {'active_dims': 174.6666717529297, 'sparsity_ratio': 0.9942773516888497}
### Calculate the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)

### tensor([[11.3767, 10.8296,  4.3457]], device='cuda:0')
### Again with smaller max_active_dims
smaller_document_embeddings = model.encode_document(documents, max_active_dims=64)

### Determine the sparsity for the smaller document embeddings
smaller_document_sparsity = model.sparsity(smaller_document_embeddings)
print(query_sparsity, smaller_document_sparsity)

### {'active_dims': 28.0, 'sparsity_ratio': 0.9990826289233995} {'active_dims': 64.0, 'sparsity_ratio': 0.9979031518249132}
### Print the similarity scores for the smaller document embeddings
smaller_similarities = model.similarity(query_embeddings, smaller_document_embeddings)
print(smaller_similarities)

### tensor([[10.1311,  9.8360,  4.3457]], device='cuda:0')
### Very similar to the scores for the full document embeddings!
Are they any good?

A big question is: How do sparse embedding models stack up against the “standard” dense embedding models, and what kind of performance can you expect when combining various?

For this, I ran a variation of our hybrid_search.py evaluation script, with:

Which resulted in this evaluation:

Dense Sparse Reranker NDCG@10 MRR@10 MAP
x 65.33 57.56 57.97
x 67.34 59.59 59.98
x x 72.39 66.99 67.59
x x 68.37 62.76 63.56
x x 69.02 63.66 64.44
x x x 68.28 62.66 63.44

Here, the sparse embedding model actually already outperforms the dense one, but the real magic happens when combining the two: hybrid search. In our case, we used Reciprocal Rank Fusion to merge the two rankings.

Rerankers also help improve the performance of the dense or sparse model here, but hurt the performance of the hybrid search, as its performance is already beyond what the reranker can achieve.

[!NOTE]
The naver/splade-v3-doc was trained on the MS MARCO training set, so this is in-domain performance, much like what you might expect if you finetune on your own data.

Resources

Check out the following links to get a better feel for what Sparse Encoders are, how they work, what architectures exist, how to use them, what pretrained models exist, how to finetune them, and more:

Update Stats

The introduction of SparseEncoder has been one of the largest updates to Sentence Transformers, introducing all of the following:

New methods:encode_query and encode_document

Sentence Transformers v5.0 introduces two new core methods to the SentenceTransformer and SparseEncoder classes: encode_query and encode_document.

These methods are specialized versions of encode that differ in exactly two ways:

  1. If no prompt_name or prompt is provided, it uses a predefined “query”/“document” prompt,
    if available in the model’s prompts dictionary (example).
  2. It sets the task to “query”/“document”. If the model has a Router
    module, it will use the “query”/“document” task type to route the input through the appropriate submodules.

In short, if you use encode_query and encode_document, you can be sure that you're using the model's predefined prompts and use the correct route (if the model has multiple routes).

If you are unsure whether you should use encode, encode_query, or encode_documen),
your best bet is to use encode_query and encode_document for Information Retrieval tasks
with clear query and document/passage distinction, and use encode for all other tasks.

Note that encode is the most general method and can be used for any task, including Information
Retrieval, and that if the model was not trained with predefined prompts and/or task types, then all three methods will return identical embeddings.

See for example this snippet, which automatically uses the “query” prompt stored in the Qwen3-Embedding-0.6B model config.

from sentence_transformers import SentenceTransformer

### Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

### The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

### Encode the queries and documents
query_embeddings = model.encode_query(queries)  # Equavalent to model.encode(queries, prompt_name="query")
document_embeddings = model.encode_document(documents)

### Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)

### tensor([[0.7646, 0.1414],
###         [0.1355, 0.6000]])

encode_multi_process absorbed by encode

The encode method (and by extension the encode_query and encode_document methods) can now be used directly for multi-processing/multi-GPU processing, instead of having to use encode_multi_process.

Previously, you had to manually start a multi-processing pool, use encode_multi_process, and stop the pool:

from sentence_transformers import SentenceTransformer

def main():
    model = SentenceTransformer("all-mpnet-base-v2")
    texts = ["The weather is so nice!", "It's so sunny outside.", ...]

    pool = model.start_multi_process_pool(["cpu", "cpu", "cpu", "cpu"])
    embeddings = model.encode_multi_process(texts, pool, chunk_size=512)
    model.stop_multi_process_pool(pool)

    print(embeddings.shape)

### => (4000, 768)

if __name__ == "__main__":
    main()

Now you can just pass a list of devices as device to encode:

from sentence_transformers import SentenceTransformer

def main():
    model = SentenceTransformer("all-mpnet-base-v2")
    texts = ["The weather is so nice!", "It's so sunny outside.", ...]

    embeddings = model.encode(texts, device=["cpu", "cpu", "cpu", "cpu"], chunk_size=512)

    print(embeddings.shape)

### => (4000, 768)

if __name__ == "__main__":
    main()

The multi-processing can be configured using these parameters:

  • device: If a list of devices, start multi-processing using those devices. Can be e.g. cpu, but also different GPUs.

  • pool: You can still use start_multi_process_pool and stop_multi_process_pool to create and stop a multi-processing pool, allowing you to reuse the pool across multiple encode calls via the pool arguments.

  • chunk_size: When you use multi-processing with n devices, then the inputs will be subdivided into chunks, and those chunks will be spread across the n processes. The size of the chunk can be defined here, although it’s optional. It can have a minor impact on processing speed and memory usage, but is much less important than the batch_size argument.

  • Documentation: Migration Guide

  • Documentation: SentenceTransformer.encode

Router module

The Sentence Transformers v5.0 release has refactored the Asym module into the Router module. The previous implementation wasn’t straightforward to use with the other components of the library. We’ve improved heavily on this to make the integration seamless. This module allows you to create asymmetric models that apply different modules depending on the specified route (often “query” or “document”).

Notably, you can use the task argument in model.encode to specify which route to use, and the model.encode_query and model.encode_document convenience methods automatically specify task="query" and task="document", respectively.

See for example opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill for an example of a model using a Router to specify different modules for queries vs documents. Its router_config.json specifies that the query route uses an efficient SparseStaticEmbedding module, while the document route uses the more expensive standard SPLADE modules: MLMTransformer with SpladePooling.

Usage is very straight-forward with the new encode_query and encode_document methods:

from sentence_transformers import SparseEncoder

### Download from the 🤗 Hub
model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
print(model)

### SparseEncoder(
###   (0): Router(

###     (query_0_SparseStaticEmbedding): SparseStaticEmbedding({'frozen': True}, dim=30522, tokenizer=DistilBertTokenizerFast)
###     (document_0_MLMTransformer): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})

###     (document_1_SpladePooling): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
###   )

### )
### Run inference
queries = ["what causes aging fast"]
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)

### [1, 30522] [3, 30522]
### Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)

### tensor([[12.0820,  6.5648,  5.0988]])

Note that if you wish to train a model with a Router, then you must specify the router_mapping training arguments that maps dataset column names to Router routes. Then the Trainer knows which route to use for each dataset column.

Note also that any models using Asym still work as before.

InputModule and Module modules

Alongside introducing some new modules and refactoring the Asym module into the Router module, we also introduced two new "superclass" modules: Module and InputModule. The former is the new base class of all modules, with the latter as the base class of all modules that are also responsible for tokenization (i.e. for processing inputs).

The documentation describes which methods still need to be implemented when you subclass one of these, and also which convenience methods are available for you to use already. It should certainly simplify the creation of custom modules.

Custom Learning Rates for parameter groups

With the introduction of the Router module, it’s becoming much simpler to train a “two-tower model” where the query and document encoders differ a lot. For example, a regular Sentence Transformer for the document encoder, and a Static Embedding model for the query encoder.

In such settings, it’s worthwhile to set different learning rates for different parts of the model. Because of this, v5.0 adds a learning_rate_mapping parameter to the Training Arguments classes. This mapping consists of parameter name regular expressions to learning rates, e.g.

args = SentenceTransformerTrainingArguments(
    ...,
    learning_rate=2e-5,
    learning_rate_mapping={"StaticEmbedding.*": 1e-3},
)

Using these training arguments, the learning rate for every parameter whose name matches the regular expression is 1e-3, while all other parameters have a learning rate of 2e-5. Note that we use re.search for determining whether a parameter matches the regular expression, not match or fullmatch.

Training with composite losses

Many models are trained with just one loss, or perhaps one loss for each dataset. In those cases, all of the losses are nicely logged in both the terminal and third party logging tools (e.g. Weights & Biases, Tensorboard, etc.).

But if you’re using one loss that has multiple components, e.g. a SpladeLoss which sums the losses from FlopsLoss and a SparseMultipleNegativesRankingLoss behind the scenes, then you’re often left guessing whether the various loss components are balanced or not: perhaps one of the two is responsible for 90% of the total loss?

As of the v5.0 release, your loss classes can output dictionaries of loss components. The Trainer will sum them and train like normal, but each of the components will also be logged individually! In short, you can see the various loss components in addition to the final loss itself in your logs.

class SpladeLoss(nn.Module):
    ...
    
    def forward(
        self, sentence_features: Iterable[dict[str, torch.Tensor]], labels: torch.Tensor | None = None
    ) -> dict[str, torch.Tensor]:

### Compute embeddings using the model
        embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]

        ...

        return {
	          "base_loss": base_loss,
	          "document_regularizer_loss": corpus_loss * self.document_regularizer_weight,
	          "query_regularizer_loss": query_loss * self.query_regularizer_weight,
        }

Schermafbeelding 2025-07-01 102421

Small improvements

  • Allow training with custom batch samplers and multi-dataset batch samplers (#​3162)
  • Gradient Checkpointing was fixed for CrossEncoder models (#​3331)
  • Add sif_coefficient, token_remove_pattern, and quantize_to parameters from Model2Vec to StaticEmbedding.from_distillation(...) (#​3349)
  • Added examples for semantic search using OpenSearch and Sentence Transformers (#​3369)
  • Added caching support to mine_hard_negatives (#​3338)
  • Add prompts support to mine_hard_negatives (#​3334)
  • You can now pass truncate_dim to encode (and encode_query, encode_document) instead of exclusively being able to set the truncate_dim when initializing the SentenceTransformer.
  • You can now access the underlying transformers model with model.transformers_model, works for SentenceTransformer, CrossEncoder, and SparseEncoder.

See our Migration Guide for more details on the changes, as well as the documentation as a whole.

All Changes

New Contributors

Thanks

I especially want to thank the following teams and individuals for their contributions to this release, small and large, in no particular order:

  • Amazon OpenSearch, for being receptive to an integration and working together on documentation, blogpost
  • NAVER, for being receptive to an integration of your excellent SPLADE models
  • Qdrant, for assisting with semantic search of sparse embeddings
  • Prithivi Da, for being receptive to an integration of your excellent Apache 2.0 SPLADE models
  • CSR authors, for working with us to integrate your architecture and open sourcing your models with an integration
  • Elastic, for assisting with semantic search of sparse embeddings
  • IBM, for being receptive to an integration of your Sparse model

Apologies if I forgot anyone.
And finally a big thanks to Arthur Bresnu, who led a lot of the work on this release. I wouldn't have been able to introduce Sparse Encoders in this fashion, in this timeline, without his excellent work.

Full Changelog: huggingface/sentence-transformers@v4.1.0...v5.0.0

v4.1.0: - ONNX and OpenVINO backends offering 2-3x speedups; improved hard negatives mining

Compare Source

This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.

Install this version with

### Training + Inference
pip install sentence-transformers[train]==4.1.0

### Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0

Faster ONNX and OpenVINO Backends for CrossEncoder (#​3319)

Introducing a new backend keyword argument to the CrossEncoder initialization, allowing values of "torch" (default), "onnx", and "openvino".
These require installing sentence-transformers with specific extras:

pip install sentence-transformers[onnx-gpu]

### or ONNX for CPU only:
pip install sentence-transformers[onnx]

### or
pip install sentence-transformers[openvino]

It's as simple as:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModelForSequenceClassification.from_pretrained or OVModelForSequenceClassification.from_pretrained. The most useful arguments are:

  • provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
  • file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
  • export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

For example:

from sentence_transformers import SentenceTransformer

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
	backend="onnx",
	model_kwargs={
		"file_name": "model_O3.onnx",
		"provider": "CPUExecutionProvider",
	}
)

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)
Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations:
image

For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.

Read the Speeding up Inference documentation for more details.

ONNX & OpenVINO Optimization and Quantization

In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models:

ONNX Optimization

export_optimized_onnx_model: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:

  • model A SentenceTransformer or CrossEncoder model loaded with backend="onnx".
  • optimization_config: "O1", "O2", "O3", or "O4" from 🤗 Optimum or a custom OptimizationConfig instance.
  • model_name_or_path: The directory or model repository where the optimized model will be saved.
  • push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
  • create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.
  • file_suffix: The suffix to add to the optimized model file name. Will use the optimization_config string or "optimized" if not set.

The usage is like this:

from sentence_transformers import SentenceTransformer, export_optimized_onnx_model

onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_optimized_onnx_model(
	model=onnx_model,
	optimization_config="O4",
	model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
	push_to_hub=True,
	create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_O4.onnx"},
    revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_O4.onnx"},
)
ONNX Quantization

export_dynamic_quantized_onnx_model: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts

  • model A SentenceTransformer or CrossEncoder model loaded with backend="onnx".
  • quantization_config: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.
  • model_name_or_path: The directory or model repository where the optimized model will be saved.
  • push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
  • create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.
  • file_suffix: The suffix to add to the optimized model file name. Will use the quantization_config string or e.g. "int8_quantized" if not set.

The usage is like this:

from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
export_dynamic_quantized_onnx_model(
    model,
    "avx512_vnni",
    "sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2",
    push_to_hub=True,
    create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
    revision=f"refs/pr/{pull_request_nr}",
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
)

OpenVINO Quantization

OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference. To do this, you can use the export_static_quantized_openvino_model() function, which saves the quantized model in a directory or model repository that you specify. Post-Training Static Quantization expects:

  • model: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
  • quantization_config: (Optional) The quantization configuration. This parameter accepts either: None for the default 8-bit quantization, a dictionary representing quantization configurations, or an OVQuantizationConfig instance.
  • model_name_or_path: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
  • dataset_name: (Optional) The name of the dataset to load for calibration. If not specified, defaults to sst2 subset from the glue dataset.
  • dataset_config_name: (Optional) The specific configuration of the dataset to load.
  • dataset_split: (Optional) The split of the dataset to load (e.g., ‘train’, ‘test’).
  • column_name: (Optional) The column name in the dataset to use for calibration.
  • push_to_hub: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
  • create_pr: (Optional) a boolean to create a pull request when pushing to the Hugging Face Hub. Useful when you don’t have write access to the repository.
  • file_suffix: (Optional) a string to append to the model name when saving it. If not specified, "qint8_quantized" will be used.

The usage is like this:

from sentence_transformers import CrossEncoder, export_static_quantized_openvino_model

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino")
export_static_quantized_openvino_model(
    model,
    quantization_config=None,
    model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
    push_to_hub=True,
    create_pr=True,
)

After which you can load the model with:

from sentence_transformers import CrossEncoder

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="openvino",
    model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
    revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="openvino",
    model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
)

Read the Speeding up Inference documentation for more details.

Relative Margin in Hard Negatives Mining (#​3321)

This PR softly deprecates the margin option in mine_hard_negatives in favor of absolute_margin and relative_margin. In short:

  • absolute_margin: Discards negative candidates whose anchor_negative_similarity score is greater than or equal to anchor_positive_similarity - absolute_margin. With an absolute_margin of 0.1 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.76.
  • relative_margin: Discards negative candidates whose anchor_negative_similarity score is greater than or equal to anchor_positive_similarity * (1 - relative_margin). With a relative_margin of 0.05 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.817 (i.e. 95% of the anchor-positive similarity).

This means that we now support the recommended hard negatives mining strategy from the excellent NV-Retriever paper, a.k.a. the TopK-PercPos (95%) strategy:

from sentence_transformers.util import mine_hard_negatives

...

dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    relative_margin=0.05,         # 0.05 means that the negative is at most 95% as similar to the anchor as the positive
    num_negatives=num_negatives,  # 10 or less is recommended
    sampling_strategy="top",      # "top" means that we sample the top candidates as negatives
    batch_size=batch_size,        # Adjust as needed
    use_faiss=True,               # Optional: Use faiss/faiss-gpu for faster similarity search
)

Minor Changes

  • Add margin and margin_strategy to GISTEmbedLoss and CachedGISTEmbedLoss (#​3299, #​3323)
  • Support activation_function=None in Dense module (#​3316)
  • Update how all_layer_embeddings outputs are determined (#​3320)
  • Avoid error with SentenceTransformer.encode if prompts are provided and output_value=None (#​3327)

All Changes

New Contributors

Full Changelog: huggingface/sentence-transformers@v4.0.2...v4.1.0

v4.0.2: - Safer reranker max sequence length logic, typing issues, FSDP & device placement

Compare Source

This patch release updates some logic for maximum sequence lengths, typing issues, FSDP training, and distributed training device placement.

Install this version with

### Training + Inference
pip install sentence-transformers[train]==4.0.2

### Inference only, use one of:
pip install sentence-transformers==4.0.2
pip install sentence-transformers[onnx-gpu]==4.0.2
pip install sentence-transformers[onnx]==4.0.2
pip install sentence-transformers[openvino]==4.0.2

Safer CrossEncoder (reranker) maximum sequence length

When loading CrossEncoder models, we now rely on the minimum of the tokenizer model_max_length and the config max_position_embeddings (if they exist), rather than only relying on the latter if it exists. This previously resulted in the maximum sequence length of BAAI/bge-reranker-base being 514, whereas it can only handle sequences up to 512 tokens.

from sentence_transformers import CrossEncoder

model = CrossEncoder("BAAI/bge-reranker-base")
print(model.max_length)

### => 512
### The texts for which to predict similarity scores
query = "How many 

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Never, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/genlayerlabs/genvm).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MC42Mi4xIiwidXBkYXRlZEluVmVyIjoiNDAuNjIuMSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOltdfQ==-->

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 1, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kp2pml30 kp2pml30 force-pushed the main branch 12 times, most recently from 42edd69 to 2e08fa7 Compare July 18, 2025 15:38
@kp2pml30 kp2pml30 force-pushed the main branch 16 times, most recently from 53d5a78 to 86b7cce Compare September 12, 2025 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant