From 908e9f2bf2ad4557b2bccc1474212e11c18ada70 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Thu, 27 Mar 2025 18:58:50 +0000 Subject: [PATCH 1/4] Modify Model Card for ModernBERT. --- docs/source/en/model_doc/modernbert.md | 71 +++++++++++++++++++++----- 1 file changed, 59 insertions(+), 12 deletions(-) diff --git a/docs/source/en/model_doc/modernbert.md b/docs/source/en/model_doc/modernbert.md index f7ceaae18797..b3cbbc7a9a03 100644 --- a/docs/source/en/model_doc/modernbert.md +++ b/docs/source/en/model_doc/modernbert.md @@ -14,21 +14,15 @@ rendered properly in your Markdown viewer. --> -# ModernBERT -
PyTorch FlashAttention SDPA
-## Overview - -The ModernBERT model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli. - -It is a refresh of the traditional encoder architecture, as used in previous models such as [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta). +# ModernBERT -It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as: +[ModernBERT](https://huggingface.co/papers/2412.13663) is an improved succesor model built on the traditional encoder architecture. It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as: - [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens. - [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences. - [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance. @@ -37,12 +31,65 @@ It builds on BERT and implements many modern architectural improvements which ha - A model designed following recent [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489), ensuring maximum efficiency across inference GPUs. - Modern training data scales (2 trillion tokens) and mixtures (including code ande math data) -The abstract from the paper is the following: - -*Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.* - The original code can be found [here](https://github.com/answerdotai/modernbert). +The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line. + + + + +```py +import torch +from transformers import pipeline + +pipeline = pipeline( + task="fill-mask", + model="answerdotai/ModernBERT-base", + torch_dtype=torch.float16, + device=0 +) +pipeline("Plants create [MASK] through a process known as photosynthesis.") +``` + + + + +```py +import torch +from transformers import AutoModelForMaskedLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + "answerdotai/ModernBERT-base", +) +model = AutoModelForMaskedLM.from_pretrained( + "answerdotai/ModernBERT-base", + torch_dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +) +inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to("cuda") + +with torch.no_grad(): + outputs = model(**inputs) + predictions = outputs.logits + +masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] +predicted_token_id = predictions[0, masked_index].argmax(dim=-1) +predicted_token = tokenizer.decode(predicted_token_id) + +print(f"The predicted token is: {predicted_token}") +``` + + + + +```bash +echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers-cli run --task fill-mask --model answerdotai/ModernBERT-base --device 0 +``` + + + + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ModernBert. From 4c908e3c0bd4278a86db7ca2233a4bb1dfc07da8 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Sat, 29 Mar 2025 10:25:40 +0530 Subject: [PATCH 2/4] Update as per code review. Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/modernbert.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/modernbert.md b/docs/source/en/model_doc/modernbert.md index f8693a1b5e82..1c88cb4fbd51 100644 --- a/docs/source/en/model_doc/modernbert.md +++ b/docs/source/en/model_doc/modernbert.md @@ -22,7 +22,12 @@ rendered properly in your Markdown viewer. # ModernBERT -[ModernBERT](https://huggingface.co/papers/2412.13663) is an improved succesor model built on the traditional encoder architecture. It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as: +[ModernBERT](https://huggingface.co/papers/2412.13663) is a modernized version of [`BERT`] trained on 2T tokens. It brings many improvements to the original architecture such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention. + +You can find all the original ModernBERT checkpoints under the [ModernBERT](https://huggingface.co/collections/answerdotai/modernbert-67627ad707a4acbf33c41deb) collection. + +> [!TIP] +> Click on the ModernBERT models in the right sidebar for more examples of how to apply ModernBERT to different language tasks. - [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens. - [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences. - [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance. From c8336e785cb1a4944e6c2f085ac0c6a6108be1a4 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Sat, 29 Mar 2025 06:26:34 +0000 Subject: [PATCH 3/4] Update model card. --- docs/source/en/model_doc/modernbert.md | 31 +++++--------------------- 1 file changed, 6 insertions(+), 25 deletions(-) diff --git a/docs/source/en/model_doc/modernbert.md b/docs/source/en/model_doc/modernbert.md index 1c88cb4fbd51..33e62b46e967 100644 --- a/docs/source/en/model_doc/modernbert.md +++ b/docs/source/en/model_doc/modernbert.md @@ -14,10 +14,12 @@ rendered properly in your Markdown viewer. --> -
-PyTorch -FlashAttention -SDPA +
+
+ PyTorch + FlashAttention + SDPA +
# ModernBERT @@ -95,27 +97,6 @@ echo -e "Plants create [MASK] through a process known as photosynthesis." | tran -## Resources - -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ModernBert. - - - -- A notebook on how to [finetune for General Language Understanding Evaluation (GLUE) with Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb), also available as a Google Colab [notebook](https://colab.research.google.com/github/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb). 🌎 - - - -- A script on how to [finetune for text similarity or information retrieval with Sentence Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_st.py). 🌎 -- A script on how to [finetune for information retrieval with PyLate](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_pylate.py). 🌎 - - - -- [Masked language modeling task guide](../tasks/masked_language_modeling) - - - -- [`ModernBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [colab notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb). - ## ModernBertConfig [[autodoc]] ModernBertConfig From b52504eda7ae43dc1b9884d04881f54ecb5cec59 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Thu, 3 Apr 2025 15:38:56 +0000 Subject: [PATCH 4/4] Update model card. --- docs/source/en/model_doc/modernbert.md | 9 --------- 1 file changed, 9 deletions(-) diff --git a/docs/source/en/model_doc/modernbert.md b/docs/source/en/model_doc/modernbert.md index 33e62b46e967..16ada230a2ae 100644 --- a/docs/source/en/model_doc/modernbert.md +++ b/docs/source/en/model_doc/modernbert.md @@ -30,15 +30,6 @@ You can find all the original ModernBERT checkpoints under the [ModernBERT](http > [!TIP] > Click on the ModernBERT models in the right sidebar for more examples of how to apply ModernBERT to different language tasks. -- [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens. -- [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences. -- [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance. -- [Alternating Attention](https://arxiv.org/abs/2004.05150v2) where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers. -- [Flash Attention](https://github.com/Dao-AILab/flash-attention) to speed up processing. -- A model designed following recent [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489), ensuring maximum efficiency across inference GPUs. -- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data) - -The original code can be found [here](https://github.com/answerdotai/modernbert). The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.