From 64dc0385305b1351ce43245622606a80702680ad Mon Sep 17 00:00:00 2001 From: end_me Date: Sat, 26 Jul 2025 14:53:09 -0700 Subject: [PATCH 1/8] standardized barthez model card according to template --- docs/source/en/model_doc/barthez.md | 96 +++++++++++++++++++++-------- 1 file changed, 70 insertions(+), 26 deletions(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 0f8568cc05ec..9641d98a71b0 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -14,49 +14,93 @@ rendered properly in your Markdown viewer. --> +
+
+ PyTorch + TensorFlow + Flax +
+
+ # BARThez -
-PyTorch -TensorFlow -Flax -
+[BARThez](https://huggingface.co/papers/2010.12321) is the first monolingual French [BART](https://huggingface.co/papers/1910.13461) model, pretrained on a large, adapted French corpus using BART’s denoising objectives. By pretraining both its encoder and decoder, it is uniquely equipped for generative tasks in French, unlike existing French [BERT](https://huggingface.co/papers/1810.04805)‑based models. + +You can find all of the original BARThez checkpoints under the [BARThez](https://huggingface.co/collections/dascim/barthez-670920b569a07aa53e3b6887) collection. + +> [!TIP] +> This model was contributed by [moussakam](https://huggingface.co/moussakam). +> Click on the BARThez models in the right sidebar for more examples of how to apply BARThez to different language tasks. + + +The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line. + + + + +```py +import torch +from transformers import pipeline + +pipeline = pipeline( + task="fill-mask", + model="moussaKam/barthez", + torch_dtype=torch.float16, + device=0 +) +pipeline("Les plantes produisent [MASK] grâce à un processus appelé photosynthèse.") +``` + + + -## Overview +```py +import torch +from transformers import AutoModelForMaskedLM, AutoTokenizer -The BARThez model was proposed in [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://huggingface.co/papers/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct, -2020. +tokenizer = AutoTokenizer.from_pretrained( + "moussaKam/barthez", +) +model = AutoModelForMaskedLM.from_pretrained( + "moussaKam/barthez", + torch_dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +) +inputs = tokenizer("Les plantes produisent [MASK] grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda") -The abstract of the paper: +with torch.no_grad(): + outputs = model(**inputs) + predictions = outputs.logits +masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] +predicted_token_id = predictions[0, masked_index].argmax(dim=-1) +predicted_token = tokenizer.decode(predicted_token_id) -*Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing -(NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language -understanding tasks. While there are some notable exceptions, most of the available models and research have been -conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language -(to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research -that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as -CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also -its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel -summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already -pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez, -provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.* +print(f"The predicted token is: {predicted_token}") +``` -This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez). + + - +```bash +echo -e "Les plantes produisent [MASK] grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0 +``` -BARThez implementation is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on -configuration classes and their parameters. BARThez-specific tokenizers are documented below. + + - +## Notes +- BARThez implementation is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on +configuration classes and their parameters. ## Resources - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check: [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). +- The authors' original github repository can be found [here](https://github.com/moussaKam/BARThez). ## BarthezTokenizer From 58fb0080530252581012ebc6a2057609b8bba6b5 Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Tue, 29 Jul 2025 14:19:27 -0700 Subject: [PATCH 2/8] Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 9641d98a71b0..7bdfcdc5de13 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -25,7 +25,7 @@ rendered properly in your Markdown viewer. # BARThez -[BARThez](https://huggingface.co/papers/2010.12321) is the first monolingual French [BART](https://huggingface.co/papers/1910.13461) model, pretrained on a large, adapted French corpus using BART’s denoising objectives. By pretraining both its encoder and decoder, it is uniquely equipped for generative tasks in French, unlike existing French [BERT](https://huggingface.co/papers/1810.04805)‑based models. +[BARThez](https://huggingface.co/papers/2010.12321) is a [BART](./bart) model designed for French language tasks. Unlike existing French BERT models, BARThez includes a pretrained encoder-decoder, allowing it to generate text as well. This model is also available as a multilingual variant, mBARThez, by continuing pretraining multilingual BART on a French corpus. You can find all of the original BARThez checkpoints under the [BARThez](https://huggingface.co/collections/dascim/barthez-670920b569a07aa53e3b6887) collection. From a20e2e165ca5e54671a5770e6e1273c9b5811178 Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Tue, 29 Jul 2025 14:19:35 -0700 Subject: [PATCH 3/8] Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 7bdfcdc5de13..279db2701b3b 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -49,7 +49,7 @@ pipeline = pipeline( torch_dtype=torch.float16, device=0 ) -pipeline("Les plantes produisent [MASK] grâce à un processus appelé photosynthèse.") +pipeline("Les plantes produisent grâce à un processus appelé photosynthèse.") ``` From b77052bbca9754b5f5fffd847a3a57556be76f72 Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Tue, 29 Jul 2025 14:19:41 -0700 Subject: [PATCH 4/8] Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 279db2701b3b..ba363cfb8dfb 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -68,7 +68,7 @@ model = AutoModelForMaskedLM.from_pretrained( device_map="auto", attn_implementation="sdpa" ) -inputs = tokenizer("Les plantes produisent [MASK] grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda") +inputs = tokenizer("Les plantes produisent grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model(**inputs) From 77881087b3b11533c4eadc08a7e7073df3503ade Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Tue, 29 Jul 2025 14:19:49 -0700 Subject: [PATCH 5/8] Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index ba363cfb8dfb..2d76810a57f9 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -66,7 +66,6 @@ model = AutoModelForMaskedLM.from_pretrained( "moussaKam/barthez", torch_dtype=torch.float16, device_map="auto", - attn_implementation="sdpa" ) inputs = tokenizer("Les plantes produisent grâce à un processus appelé photosynthèse.", return_tensors="pt").to("cuda") From 37e41b53daf002c7b992decdf27c198ff59b13ec Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Tue, 29 Jul 2025 14:19:56 -0700 Subject: [PATCH 6/8] Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 2d76810a57f9..5a4e46bd94f4 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -84,7 +84,7 @@ print(f"The predicted token is: {predicted_token}") ```bash -echo -e "Les plantes produisent [MASK] grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0 +echo -e "Les plantes produisent grâce à un processus appelé photosynthèse." | transformers run --task fill-mask --model moussaKam/barthez --device 0 ``` From 71cddee4928736037d462d979f3368d6ca668793 Mon Sep 17 00:00:00 2001 From: Ethan Villarosa <113210015+EthanV431@users.noreply.github.com> Date: Tue, 29 Jul 2025 14:20:04 -0700 Subject: [PATCH 7/8] Update docs/source/en/model_doc/barthez.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/barthez.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 5a4e46bd94f4..92803dd9ef04 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -31,7 +31,7 @@ You can find all of the original BARThez checkpoints under the [BARThez](https:/ > [!TIP] > This model was contributed by [moussakam](https://huggingface.co/moussakam). -> Click on the BARThez models in the right sidebar for more examples of how to apply BARThez to different language tasks. +> Refer to the [BART](./bart) docs for more usage examples. The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line. From 882885ec7c5f9d8e71fc7509398168d4eb05f101 Mon Sep 17 00:00:00 2001 From: end_me Date: Tue, 29 Jul 2025 14:27:25 -0700 Subject: [PATCH 8/8] suggested changes to barthez model card --- docs/source/en/model_doc/barthez.md | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/docs/source/en/model_doc/barthez.md b/docs/source/en/model_doc/barthez.md index 92803dd9ef04..fdaf28c8d7d4 100644 --- a/docs/source/en/model_doc/barthez.md +++ b/docs/source/en/model_doc/barthez.md @@ -34,7 +34,7 @@ You can find all of the original BARThez checkpoints under the [BARThez](https:/ > Refer to the [BART](./bart) docs for more usage examples. -The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line. +The example below demonstrates how to predict the `` token with [`Pipeline`], [`AutoModel`], and from the command line. @@ -90,17 +90,6 @@ echo -e "Les plantes produisent grâce à un processus appelé photosynth -## Notes -- BARThez implementation is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on -configuration classes and their parameters. - -## Resources - -- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check: - [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). - -- The authors' original github repository can be found [here](https://github.com/moussaKam/BARThez). - ## BarthezTokenizer [[autodoc]] BarthezTokenizer