chore(deps): update dependency sentence-transformers to v5 #228
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
^3.2.1->^5.0.0Release Notes
UKPLab/sentence-transformers (sentence-transformers)
v5.0.0: - SparseEncoder support; encode_query & encode_document; multi-processing in encode; Router; and moreCompare Source
This release consists of significant updates including the introduction of Sparse Encoder models, new methods
encode_queryandencode_document, multi-processing support inencode, theRoutermodule for asymmetric models, custom learning rates for parameter groups, composite loss logging, and various small improvements and bug fixes.Install this version with
Sparse Encoder models
The Sentence Transformers v5.0 release introduces Sparse Embedding models, also known as Sparse Encoders. These models generate high-dimensional embeddings, often with 30,000+ dimensions, where often only <1% of dimensions are non-zero. This is in contrast to the standard dense embedding models, which produce low-dimensional embeddings (e.g., 384, 768, or 1024 dimensions) where all values are non-zero.
Usually, each active dimension (i.e. the dimension with a non-zero value) in a sparse embedding corresponds to a specific token in the model's vocabulary, allowing for interpretability. This means that you can e.g. see exactly which words/tokens are important in an embedding, and that you can inspect exactly because of which words/tokens two texts are deemed similar.
Let's have a look at naver/splade-v3, a strong sparse embedding model, as an example:
In this example, the embeddings are 30,522-dimensional vectors, where each dimension corresponds to a token in the model's vocabulary. The
decodemethod returned the top 10 tokens with the highest values in the embedding, allowing us to interpret which tokens contribute most to the embedding.We can even determine the intersection or overlap between embeddings, very useful for determining why two texts are deemed similar or dissimilar:
And if we think the embeddings are too big, we can limit the maximum number of active dimensions like so:
Click to see that it has minimal impact on scores
Are they any good?
A big question is: How do sparse embedding models stack up against the “standard” dense embedding models, and what kind of performance can you expect when combining various?
For this, I ran a variation of our hybrid_search.py evaluation script, with:
Which resulted in this evaluation:
Here, the sparse embedding model actually already outperforms the dense one, but the real magic happens when combining the two: hybrid search. In our case, we used Reciprocal Rank Fusion to merge the two rankings.
Rerankers also help improve the performance of the dense or sparse model here, but hurt the performance of the hybrid search, as its performance is already beyond what the reranker can achieve.
Resources
Check out the following links to get a better feel for what Sparse Encoders are, how they work, what architectures exist, how to use them, what pretrained models exist, how to finetune them, and more:
Update Stats
The introduction of SparseEncoder has been one of the largest updates to Sentence Transformers, introducing all of the following:
New methods:
encode_queryandencode_documentSentence Transformers v5.0 introduces two new core methods to the
SentenceTransformerandSparseEncoderclasses:encode_queryandencode_document.These methods are specialized versions of
encodethat differ in exactly two ways:prompt_nameorpromptis provided, it uses a predefined “query”/“document” prompt,if available in the model’s
promptsdictionary (example).taskto “query”/“document”. If the model has aRoutermodule, it will use the “query”/“document” task type to route the input through the appropriate submodules.
In short, if you use
encode_queryandencode_document, you can be sure that you're using the model's predefined prompts and use the correct route (if the model has multiple routes).If you are unsure whether you should use
encode,encode_query, orencode_documen),your best bet is to use
encode_queryandencode_documentfor Information Retrieval taskswith clear query and document/passage distinction, and use
encodefor all other tasks.Note that
encodeis the most general method and can be used for any task, including InformationRetrieval, and that if the model was not trained with predefined prompts and/or task types, then all three methods will return identical embeddings.
See for example this snippet, which automatically uses the “query” prompt stored in the Qwen3-Embedding-0.6B model config.
encode_multi_processabsorbed byencodeThe
encodemethod (and by extension theencode_queryandencode_documentmethods) can now be used directly for multi-processing/multi-GPU processing, instead of having to useencode_multi_process.Previously, you had to manually start a multi-processing pool, use
encode_multi_process, and stop the pool:Now you can just pass a list of devices as
devicetoencode:The multi-processing can be configured using these parameters:
device: If a list of devices, start multi-processing using those devices. Can be e.g. cpu, but also different GPUs.pool: You can still usestart_multi_process_poolandstop_multi_process_poolto create and stop a multi-processing pool, allowing you to reuse the pool across multipleencodecalls via thepoolarguments.chunk_size: When you use multi-processing with n devices, then the inputs will be subdivided into chunks, and those chunks will be spread across the n processes. The size of the chunk can be defined here, although it’s optional. It can have a minor impact on processing speed and memory usage, but is much less important than thebatch_sizeargument.Documentation: Migration Guide
Documentation: SentenceTransformer.encode
Router module
The Sentence Transformers v5.0 release has refactored the
Asymmodule into theRoutermodule. The previous implementation wasn’t straightforward to use with the other components of the library. We’ve improved heavily on this to make the integration seamless. This module allows you to create asymmetric models that apply different modules depending on the specified route (often “query” or “document”).Notably, you can use the
taskargument inmodel.encodeto specify which route to use, and themodel.encode_queryandmodel.encode_documentconvenience methods automatically specifytask="query"andtask="document", respectively.See for example opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill for an example of a model using a
Routerto specify different modules for queries vs documents. Its router_config.json specifies that the query route uses an efficientSparseStaticEmbeddingmodule, while the document route uses the more expensive standard SPLADE modules:MLMTransformerwithSpladePooling.Usage is very straight-forward with the new
encode_queryandencode_documentmethods:Note that if you wish to train a model with a
Router, then you must specify therouter_mappingtraining arguments that maps dataset column names toRouterroutes. Then the Trainer knows which route to use for each dataset column.Note also that any models using
Asymstill work as before.InputModule and Module modules
Alongside introducing some new modules and refactoring the
Asymmodule into theRoutermodule, we also introduced two new "superclass" modules:ModuleandInputModule. The former is the new base class of all modules, with the latter as the base class of all modules that are also responsible for tokenization (i.e. for processing inputs).The documentation describes which methods still need to be implemented when you subclass one of these, and also which convenience methods are available for you to use already. It should certainly simplify the creation of custom modules.
Custom Learning Rates for parameter groups
With the introduction of the
Routermodule, it’s becoming much simpler to train a “two-tower model” where the query and document encoders differ a lot. For example, a regular Sentence Transformer for the document encoder, and a Static Embedding model for the query encoder.In such settings, it’s worthwhile to set different learning rates for different parts of the model. Because of this, v5.0 adds a
learning_rate_mappingparameter to the Training Arguments classes. This mapping consists of parameter name regular expressions to learning rates, e.g.Using these training arguments, the learning rate for every parameter whose name matches the regular expression is 1e-3, while all other parameters have a learning rate of 2e-5. Note that we use
re.searchfor determining whether a parameter matches the regular expression, notmatchorfullmatch.Training with composite losses
Many models are trained with just one loss, or perhaps one loss for each dataset. In those cases, all of the losses are nicely logged in both the terminal and third party logging tools (e.g. Weights & Biases, Tensorboard, etc.).
But if you’re using one loss that has multiple components, e.g. a SpladeLoss which sums the losses from FlopsLoss and a SparseMultipleNegativesRankingLoss behind the scenes, then you’re often left guessing whether the various loss components are balanced or not: perhaps one of the two is responsible for 90% of the total loss?
As of the v5.0 release, your loss classes can output dictionaries of loss components. The Trainer will sum them and train like normal, but each of the components will also be logged individually! In short, you can see the various loss components in addition to the final loss itself in your logs.
Small improvements
sif_coefficient,token_remove_pattern, andquantize_toparameters from Model2Vec toStaticEmbedding.from_distillation(...)(#3349)mine_hard_negatives(#3338)mine_hard_negatives(#3334)truncate_dimtoencode(andencode_query,encode_document) instead of exclusively being able to set thetruncate_dimwhen initializing theSentenceTransformer.transformersmodel withmodel.transformers_model, works forSentenceTransformer,CrossEncoder, andSparseEncoder.See our Migration Guide for more details on the changes, as well as the documentation as a whole.
All Changes
docs] Point to v4.1 new docs pages in index.html by @tomaarsen in #3328ci] Attempt to avoid 429 Client Error in CI by @tomaarsen in #3342fix,cross-encoder] Propagate the gradient checkpointing to the transformer model by @tomaarsen in #3331tests] Update test based on M2V version by @tomaarsen in #3354docs] Add two useful recommendations to the docs by @tomaarsen in #3353refactor] Refactor module loading; introduce Module subclass by @tomaarsen in #3345tests] Improve robustness of model shape assertion in model2vec test by @tomaarsen in #3391fix] Use transformers Peft integration instead of manual get_peft_model call by @tomaarsen in #3405v5] Add support for Sparse Embedding models by @arthurbr11 in #3401docs] Fix formatting of docstring arguments in SpladeRegularizerWeightSchedulerCallback by @tomaarsen in #3408fix] Update .gitignore by @arthurbr11 in #3409fix] Remove hub_kwargs in SparseStaticEmbedding.from_json in favor of more explicit kwargs by @tomaarsen in #3407docs] Update collections links by @arthurbr11 in #3410New Contributors
Thanks
I especially want to thank the following teams and individuals for their contributions to this release, small and large, in no particular order:
Apologies if I forgot anyone.
And finally a big thanks to Arthur Bresnu, who led a lot of the work on this release. I wouldn't have been able to introduce Sparse Encoders in this fashion, in this timeline, without his excellent work.
Full Changelog: huggingface/sentence-transformers@v4.1.0...v5.0.0
v4.1.0: - ONNX and OpenVINO backends offering 2-3x speedups; improved hard negatives miningCompare Source
This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.
Install this version with
Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)
Introducing a new
backendkeyword argument to theCrossEncoderinitialization, allowing values of"torch"(default),"onnx", and"openvino".These require installing
sentence-transformerswith specific extras:It's as simple as:
If you specify a
backendand your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember tomodel.push_to_hubormodel.save_pretrainedinto the same model repository or directory to avoid having to re-export the model every time.All keyword arguments passed via
model_kwargswill be passed on toORTModelForSequenceClassification.from_pretrainedorOVModelForSequenceClassification.from_pretrained. The most useful arguments are:provider: (Only ifbackend="onnx") ONNX Runtime provider to use for loading the model, e.g."CPUExecutionProvider". See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g."CUDAExecutionProvider") will be used.file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.For example:
Benchmarks
We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:
These findings resulted in these recommendations:

For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.
Read the Speeding up Inference documentation for more details.
ONNX & OpenVINO Optimization and Quantization
In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models:
ONNX Optimization
export_optimized_onnx_model: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:modelA SentenceTransformer or CrossEncoder model loaded withbackend="onnx".optimization_config: "O1", "O2", "O3", or "O4" from 🤗 Optimum or a customOptimizationConfiginstance.model_name_or_path: The directory or model repository where the optimized model will be saved.push_to_hub: Whether the push the exported model to the hub withmodel_name_or_pathas the repository name. If False, the model will be saved in the directory specified withmodel_name_or_path.create_pr: Ifpush_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.file_suffix: The suffix to add to the optimized model file name. Will use theoptimization_configstring or"optimized"if not set.The usage is like this:
After which you can load the model with:
or when it gets merged:
ONNX Quantization
export_dynamic_quantized_onnx_model: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function acceptsmodelA SentenceTransformer or CrossEncoder model loaded withbackend="onnx".quantization_config: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.model_name_or_path: The directory or model repository where the optimized model will be saved.push_to_hub: Whether the push the exported model to the hub withmodel_name_or_pathas the repository name. If False, the model will be saved in the directory specified withmodel_name_or_path.create_pr: Ifpush_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.file_suffix: The suffix to add to the optimized model file name. Will use thequantization_configstring or e.g."int8_quantized"if not set.The usage is like this:
After which you can load the model with:
or when it gets merged:
OpenVINO Quantization
OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference. To do this, you can use the export_static_quantized_openvino_model() function, which saves the quantized model in a directory or model repository that you specify. Post-Training Static Quantization expects:
model: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.quantization_config: (Optional) The quantization configuration. This parameter accepts either: None for the default 8-bit quantization, a dictionary representing quantization configurations, or an OVQuantizationConfig instance.model_name_or_path: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.dataset_name: (Optional) The name of the dataset to load for calibration. If not specified, defaults to sst2 subset from the glue dataset.dataset_config_name: (Optional) The specific configuration of the dataset to load.dataset_split: (Optional) The split of the dataset to load (e.g., ‘train’, ‘test’).column_name: (Optional) The column name in the dataset to use for calibration.push_to_hub: (Optional) a boolean to push the quantized model to the Hugging Face Hub.create_pr: (Optional) a boolean to create a pull request when pushing to the Hugging Face Hub. Useful when you don’t have write access to the repository.file_suffix: (Optional) a string to append to the model name when saving it. If not specified, "qint8_quantized" will be used.The usage is like this:
After which you can load the model with:
or when it gets merged:
Read the Speeding up Inference documentation for more details.
Relative Margin in Hard Negatives Mining (#3321)
This PR softly deprecates the
marginoption inmine_hard_negativesin favor ofabsolute_marginandrelative_margin. In short:absolute_margin: Discards negative candidates whoseanchor_negative_similarityscore is greater than or equal toanchor_positive_similarity - absolute_margin. With anabsolute_marginof 0.1 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.76.relative_margin: Discards negative candidates whoseanchor_negative_similarityscore is greater than or equal toanchor_positive_similarity * (1 - relative_margin). With arelative_marginof 0.05 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.817 (i.e. 95% of the anchor-positive similarity).This means that we now support the recommended hard negatives mining strategy from the excellent NV-Retriever paper, a.k.a. the TopK-PercPos (95%) strategy:
Minor Changes
marginandmargin_strategyto GISTEmbedLoss and CachedGISTEmbedLoss (#3299, #3323)activation_function=Nonein Dense module (#3316)all_layer_embeddingsoutputs are determined (#3320)SentenceTransformer.encodeifpromptsare provided andoutput_value=None(#3327)All Changes
docs] Update a removed article with a new source by @lakshminarasimmanv in https://github.com/UKPLab/sentence-transformers/pull/3309typing] Fix typing for CrossEncoder.to by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3324feat] hard neg mining: deprecate margin in favor of absolute_margin & relative margin by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3321fix] Use return_dict=True in Transformer; improve how all_layer_embeddings are determined by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3320fix] Avoid error if prompts & output_value=None by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3327backend] Add ONNX & OpenVINO support for Cross Encoder (reranker) models by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3319New Contributors
Full Changelog: huggingface/sentence-transformers@v4.0.2...v4.1.0
v4.0.2: - Safer reranker max sequence length logic, typing issues, FSDP & device placementCompare Source
This patch release updates some logic for maximum sequence lengths, typing issues, FSDP training, and distributed training device placement.
Install this version with
Safer CrossEncoder (reranker) maximum sequence length
When loading
CrossEncodermodels, we now rely on the minimum of the tokenizermodel_max_lengthand the configmax_position_embeddings(if they exist), rather than only relying on the latter if it exists. This previously resulted in the maximum sequence length of BAAI/bge-reranker-base being 514, whereas it can only handle sequences up to 512 tokens.