From 2792fa344b9ad7ca62a0066b0216b22085ba1d47 Mon Sep 17 00:00:00 2001 From: Wun0 Date: Fri, 28 Mar 2025 00:14:54 -0400 Subject: [PATCH 01/12] Update ELECTRA model card with new format --- docs/source/en/model_doc/electra.md | 144 +++++++++++++++++----------- 1 file changed, 87 insertions(+), 57 deletions(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index bee883d64153..57dd38fa5857 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -14,66 +14,96 @@ rendered properly in your Markdown viewer. --> -# ELECTRA - -
-PyTorch -TensorFlow -Flax +
+ PyTorch + TensorFlow + Flax +
-## Overview - -The ELECTRA model was proposed in the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than -Generators](https://openreview.net/pdf?id=r1xMH1BtvB). ELECTRA is a new pretraining approach which trains two -transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and -is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to -identify which tokens were replaced by the generator in the sequence. - -The abstract from the paper is the following: - -*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK] -and then train a model to reconstruct the original tokens. While they produce good results when transferred to -downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a -more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach -corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead -of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that -predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments -demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens -rather than just the small subset that was masked out. As a result, the contextual representations learned by our -approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are -particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained -using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, -where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when -using the same amount of compute.* - -This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra). - -## Usage tips - -- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The - only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller, - while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their - embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection - layer is used. -- ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA has to predict which token is an original and which one has been replaced. Like for GAN training, the small language model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a traditional GAN setting) then the ELECTRA model is trained for a few steps. -- The ELECTRA checkpoints saved using [Google Research's implementation](https://github.com/google-research/electra) - contain both the generator and discriminator. The conversion script requires the user to name which model to export - into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all - available ELECTRA models, however. This means that the discriminator may be loaded in the - [`ElectraForMaskedLM`] model, and the generator may be loaded in the - [`ElectraForPreTraining`] model (the classification head will be randomly initialized as it - doesn't exist in the generator). - -## Resources - -- [Text classification task guide](../tasks/sequence_classification) -- [Token classification task guide](../tasks/token_classification) -- [Question answering task guide](../tasks/question_answering) -- [Causal language modeling task guide](../tasks/language_modeling) -- [Masked language modeling task guide](../tasks/masked_language_modeling) -- [Multiple choice task guide](../tasks/multiple_choice) +# ELECTRA + +[ELECTRA](https://huggingface.co/papers/2003.10555) is a clever alternative to traditional masked language models like BERT. Instead of just masking tokens and asking the model to predict them, ELECTRA trains two models working together: a generator and a discriminator. The generator replaces some tokens with plausible alternatives, and the discriminator (the model you'll actually use) learns to detect which tokens are original and which were replaced. + +This approach is super efficient because ELECTRA learns from every single token in the input, not just the masked ones. That's why even the small ELECTRA models can match or outperform much larger models while using way less computing resources. + +You can find all the original ELECTRA checkpoints under the [ELECTRA release](https://huggingface.co/collections/google/electra-release-64ff6e8b18830fabea30a1ab) collection. + +> [!TIP] +> Click on the right sidebar for more examples of how to use ELECTRA for different language tasks like sequence classification, token classification, and question answering. + +The example below demonstrates how to use ELECTRA for text classification tasks with [`Pipeline`] or the [`AutoModel`] class. + + + + +```py +from transformers import pipeline + +classifier = pipeline("text-classification", model="google/electra-small-discriminator") +result = classifier("This restaurant has amazing food!") +print(result) +``` + + + + +```py +from transformers import AutoTokenizer, AutoModelForSequenceClassification +import torch +tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator") +model = AutoModelForSequenceClassification.from_pretrained("google/electra-small-discriminator") +inputs = tokenizer("ELECTRA is more efficient than BERT", return_tensors="pt") +with torch.no_grad(): +outputs = model(**inputs) +logits = outputs.logits +print(logits) +``` + + + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [torchao](../quantization/torchao) to quantize the weights to int4: + +```py +# pip install torchao +import torch +from transformers import TorchAoConfig, AutoModelForSequenceClassification, AutoTokenizer + +quantization_config = TorchAoConfig("int4_weight_only", group_size=128) +model = AutoModelForSequenceClassification.from_pretrained( + "google/electra-large-discriminator", + torch_dtype=torch.bfloat16, + device_map="auto", + quantization_config=quantization_config +) +tokenizer = AutoTokenizer.from_pretrained("google/electra-large-discriminator") +inputs = tokenizer("ELECTRA uses less compute than other models", return_tensors="pt").to("cuda") +with torch.no_grad(): + outputs = model(**inputs) + logits = outputs.logits +print(logits) +``` + +## Notes + +- ELECTRA consists of two transformer models: a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (`*-discriminator`) rather than the generator. +- ELECTRA comes in three sizes: Small (14M parameters), Base (110M parameters), and Large (335M parameters). +- ELECTRA can use a smaller embedding size than hidden size for efficiency. When `embedding_size` is set smaller than `hidden_size` in the configuration, a projection layer connects them. +- When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens: + + ```py + # Example of properly handling padding with attention masks + inputs = tokenizer(["Short text", "This is a much longer text that needs padding"], + padding=True, + return_tensors="pt") + outputs = model(**inputs) # automatically uses the attention_mask + ``` +- When using the discriminator for your downstream task, you can load it into any of the ELECTRA model classes (e.g., `ElectraForSequenceClassification`, `ElectraForTokenClassification`). ## ElectraConfig From fec39c02234cfc2da9adc1bdf24345709dc548c7 Mon Sep 17 00:00:00 2001 From: Wun0 Date: Fri, 28 Mar 2025 00:14:54 -0400 Subject: [PATCH 02/12] Update ELECTRA model card with new format --- docs/source/en/model_doc/electra.md | 144 +++++++++++++++++----------- 1 file changed, 87 insertions(+), 57 deletions(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index bee883d64153..57dd38fa5857 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -14,66 +14,96 @@ rendered properly in your Markdown viewer. --> -# ELECTRA - -
-PyTorch -TensorFlow -Flax +
+ PyTorch + TensorFlow + Flax +
-## Overview - -The ELECTRA model was proposed in the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than -Generators](https://openreview.net/pdf?id=r1xMH1BtvB). ELECTRA is a new pretraining approach which trains two -transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and -is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to -identify which tokens were replaced by the generator in the sequence. - -The abstract from the paper is the following: - -*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK] -and then train a model to reconstruct the original tokens. While they produce good results when transferred to -downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a -more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach -corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead -of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that -predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments -demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens -rather than just the small subset that was masked out. As a result, the contextual representations learned by our -approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are -particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained -using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, -where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when -using the same amount of compute.* - -This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra). - -## Usage tips - -- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The - only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller, - while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their - embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection - layer is used. -- ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA has to predict which token is an original and which one has been replaced. Like for GAN training, the small language model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a traditional GAN setting) then the ELECTRA model is trained for a few steps. -- The ELECTRA checkpoints saved using [Google Research's implementation](https://github.com/google-research/electra) - contain both the generator and discriminator. The conversion script requires the user to name which model to export - into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all - available ELECTRA models, however. This means that the discriminator may be loaded in the - [`ElectraForMaskedLM`] model, and the generator may be loaded in the - [`ElectraForPreTraining`] model (the classification head will be randomly initialized as it - doesn't exist in the generator). - -## Resources - -- [Text classification task guide](../tasks/sequence_classification) -- [Token classification task guide](../tasks/token_classification) -- [Question answering task guide](../tasks/question_answering) -- [Causal language modeling task guide](../tasks/language_modeling) -- [Masked language modeling task guide](../tasks/masked_language_modeling) -- [Multiple choice task guide](../tasks/multiple_choice) +# ELECTRA + +[ELECTRA](https://huggingface.co/papers/2003.10555) is a clever alternative to traditional masked language models like BERT. Instead of just masking tokens and asking the model to predict them, ELECTRA trains two models working together: a generator and a discriminator. The generator replaces some tokens with plausible alternatives, and the discriminator (the model you'll actually use) learns to detect which tokens are original and which were replaced. + +This approach is super efficient because ELECTRA learns from every single token in the input, not just the masked ones. That's why even the small ELECTRA models can match or outperform much larger models while using way less computing resources. + +You can find all the original ELECTRA checkpoints under the [ELECTRA release](https://huggingface.co/collections/google/electra-release-64ff6e8b18830fabea30a1ab) collection. + +> [!TIP] +> Click on the right sidebar for more examples of how to use ELECTRA for different language tasks like sequence classification, token classification, and question answering. + +The example below demonstrates how to use ELECTRA for text classification tasks with [`Pipeline`] or the [`AutoModel`] class. + + + + +```py +from transformers import pipeline + +classifier = pipeline("text-classification", model="google/electra-small-discriminator") +result = classifier("This restaurant has amazing food!") +print(result) +``` + + + + +```py +from transformers import AutoTokenizer, AutoModelForSequenceClassification +import torch +tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator") +model = AutoModelForSequenceClassification.from_pretrained("google/electra-small-discriminator") +inputs = tokenizer("ELECTRA is more efficient than BERT", return_tensors="pt") +with torch.no_grad(): +outputs = model(**inputs) +logits = outputs.logits +print(logits) +``` + + + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [torchao](../quantization/torchao) to quantize the weights to int4: + +```py +# pip install torchao +import torch +from transformers import TorchAoConfig, AutoModelForSequenceClassification, AutoTokenizer + +quantization_config = TorchAoConfig("int4_weight_only", group_size=128) +model = AutoModelForSequenceClassification.from_pretrained( + "google/electra-large-discriminator", + torch_dtype=torch.bfloat16, + device_map="auto", + quantization_config=quantization_config +) +tokenizer = AutoTokenizer.from_pretrained("google/electra-large-discriminator") +inputs = tokenizer("ELECTRA uses less compute than other models", return_tensors="pt").to("cuda") +with torch.no_grad(): + outputs = model(**inputs) + logits = outputs.logits +print(logits) +``` + +## Notes + +- ELECTRA consists of two transformer models: a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (`*-discriminator`) rather than the generator. +- ELECTRA comes in three sizes: Small (14M parameters), Base (110M parameters), and Large (335M parameters). +- ELECTRA can use a smaller embedding size than hidden size for efficiency. When `embedding_size` is set smaller than `hidden_size` in the configuration, a projection layer connects them. +- When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens: + + ```py + # Example of properly handling padding with attention masks + inputs = tokenizer(["Short text", "This is a much longer text that needs padding"], + padding=True, + return_tensors="pt") + outputs = model(**inputs) # automatically uses the attention_mask + ``` +- When using the discriminator for your downstream task, you can load it into any of the ELECTRA model classes (e.g., `ElectraForSequenceClassification`, `ElectraForTokenClassification`). ## ElectraConfig From 4903de03d91e612089cab131d61cf8ed68705cbf Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:20:58 -0400 Subject: [PATCH 03/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index 57dd38fa5857..dbbdca60eca6 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -25,7 +25,7 @@ rendered properly in your Markdown viewer. # ELECTRA -[ELECTRA](https://huggingface.co/papers/2003.10555) is a clever alternative to traditional masked language models like BERT. Instead of just masking tokens and asking the model to predict them, ELECTRA trains two models working together: a generator and a discriminator. The generator replaces some tokens with plausible alternatives, and the discriminator (the model you'll actually use) learns to detect which tokens are original and which were replaced. +[ELECTRA](https://huggingface.co/papers/2003.10555) modifies the pretraining objective of traditional masked language models like BERT. Instead of just masking tokens and asking the model to predict them, ELECTRA trains two models working, a generator and a discriminator. The generator replaces some tokens with plausible alternatives and the discriminator (the model you'll actually use) learns to detect which tokens are original and which were replaced. This training approach is very efficient and scales to larger models while using considerably less compute. This approach is super efficient because ELECTRA learns from every single token in the input, not just the masked ones. That's why even the small ELECTRA models can match or outperform much larger models while using way less computing resources. From b1bc9b9c169f2fccd7db069484c7aa1155750284 Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:21:08 -0400 Subject: [PATCH 04/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index dbbdca60eca6..64891b58314b 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -29,7 +29,7 @@ rendered properly in your Markdown viewer. This approach is super efficient because ELECTRA learns from every single token in the input, not just the masked ones. That's why even the small ELECTRA models can match or outperform much larger models while using way less computing resources. -You can find all the original ELECTRA checkpoints under the [ELECTRA release](https://huggingface.co/collections/google/electra-release-64ff6e8b18830fabea30a1ab) collection. +You can find all the original ELECTRA checkpoints under the [ELECTRA](https://huggingface.co/collections/google/electra-release-64ff6e8b18830fabea30a1ab) release. > [!TIP] > Click on the right sidebar for more examples of how to use ELECTRA for different language tasks like sequence classification, token classification, and question answering. From 8524ba7bd9c987f83c746bac4b1df3b33af20f01 Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:21:18 -0400 Subject: [PATCH 05/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index 64891b58314b..4b6d16d8e363 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -34,7 +34,7 @@ You can find all the original ELECTRA checkpoints under the [ELECTRA](https://hu > [!TIP] > Click on the right sidebar for more examples of how to use ELECTRA for different language tasks like sequence classification, token classification, and question answering. -The example below demonstrates how to use ELECTRA for text classification tasks with [`Pipeline`] or the [`AutoModel`] class. +The example below demonstrates how to classify text with [`Pipeline`] or the [`AutoModel`] class. From 1db39262e6ac04194488580cdfd447724750f217 Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:21:52 -0400 Subject: [PATCH 06/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index 4b6d16d8e363..d6f0e4ffb804 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -91,7 +91,7 @@ print(logits) ## Notes -- ELECTRA consists of two transformer models: a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (`*-discriminator`) rather than the generator. +- ELECTRA consists of two transformer models, a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (as indicated by `*-discriminator` in the name) rather than the generator. - ELECTRA comes in three sizes: Small (14M parameters), Base (110M parameters), and Large (335M parameters). - ELECTRA can use a smaller embedding size than hidden size for efficiency. When `embedding_size` is set smaller than `hidden_size` in the configuration, a projection layer connects them. - When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens: From 7084017bdf3b1ca4e7c7c435717ebabcbbe3991b Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:22:19 -0400 Subject: [PATCH 07/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index d6f0e4ffb804..80431081b8f4 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -93,7 +93,7 @@ print(logits) - ELECTRA consists of two transformer models, a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (as indicated by `*-discriminator` in the name) rather than the generator. - ELECTRA comes in three sizes: Small (14M parameters), Base (110M parameters), and Large (335M parameters). -- ELECTRA can use a smaller embedding size than hidden size for efficiency. When `embedding_size` is set smaller than `hidden_size` in the configuration, a projection layer connects them. +- ELECTRA can use a smaller embedding size than the hidden size for efficiency. When `embedding_size` is smaller than `hidden_size` in the configuration, a projection layer connects them. - When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens: ```py From 712fe88741b920ecb6f9f587278740931feebad2 Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:22:36 -0400 Subject: [PATCH 08/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index 80431081b8f4..f49278e2fad0 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -94,7 +94,7 @@ print(logits) - ELECTRA consists of two transformer models, a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (as indicated by `*-discriminator` in the name) rather than the generator. - ELECTRA comes in three sizes: Small (14M parameters), Base (110M parameters), and Large (335M parameters). - ELECTRA can use a smaller embedding size than the hidden size for efficiency. When `embedding_size` is smaller than `hidden_size` in the configuration, a projection layer connects them. -- When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens: +- When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens. ```py # Example of properly handling padding with attention masks From 034aeafb0a332bc4d455d2e6d31bd3e734c1b3cd Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:22:49 -0400 Subject: [PATCH 09/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index f49278e2fad0..d474b889a1f1 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -103,7 +103,7 @@ print(logits) return_tensors="pt") outputs = model(**inputs) # automatically uses the attention_mask ``` -- When using the discriminator for your downstream task, you can load it into any of the ELECTRA model classes (e.g., `ElectraForSequenceClassification`, `ElectraForTokenClassification`). +- When using the discriminator for a downstream task, you can load it into any of the ELECTRA model classes ([`ElectraForSequenceClassification`], [`ElectraForTokenClassification`], etc.). ## ElectraConfig From 142bf6d4a3e1edf0c32295a2e8e86b8a065459d4 Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Sun, 30 Mar 2025 04:23:09 -0400 Subject: [PATCH 10/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index d474b889a1f1..940456776fa4 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -92,7 +92,7 @@ print(logits) ## Notes - ELECTRA consists of two transformer models, a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (as indicated by `*-discriminator` in the name) rather than the generator. -- ELECTRA comes in three sizes: Small (14M parameters), Base (110M parameters), and Large (335M parameters). +- ELECTRA comes in three sizes: small (14M parameters), base (110M parameters), and large (335M parameters). - ELECTRA can use a smaller embedding size than the hidden size for efficiency. When `embedding_size` is smaller than `hidden_size` in the configuration, a projection layer connects them. - When using batched inputs with padding, make sure to use attention masks to prevent the model from attending to padding tokens. From cc80208c25aa7b9c7ca77c183a1871b9f4dfa3e1 Mon Sep 17 00:00:00 2001 From: Surya Garikipati <86141988+Wu-n0@users.noreply.github.com> Date: Thu, 3 Apr 2025 00:02:23 -0400 Subject: [PATCH 11/12] Update docs/source/en/model_doc/electra.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/electra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index 940456776fa4..0b9f68d4642a 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -25,7 +25,7 @@ rendered properly in your Markdown viewer. # ELECTRA -[ELECTRA](https://huggingface.co/papers/2003.10555) modifies the pretraining objective of traditional masked language models like BERT. Instead of just masking tokens and asking the model to predict them, ELECTRA trains two models working, a generator and a discriminator. The generator replaces some tokens with plausible alternatives and the discriminator (the model you'll actually use) learns to detect which tokens are original and which were replaced. This training approach is very efficient and scales to larger models while using considerably less compute. +[ELECTRA](https://huggingface.co/papers/2003.10555) modifies the pretraining objective of traditional masked language models like BERT. Instead of just masking tokens and asking the model to predict them, ELECTRA trains two models, a generator and a discriminator. The generator replaces some tokens with plausible alternatives and the discriminator (the model you'll actually use) learns to detect which tokens are original and which were replaced. This training approach is very efficient and scales to larger models while using considerably less compute. This approach is super efficient because ELECTRA learns from every single token in the input, not just the masked ones. That's why even the small ELECTRA models can match or outperform much larger models while using way less computing resources. From 6cf7861accc5bba737801df46bec360b68a59353 Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Thu, 3 Apr 2025 10:06:50 -0700 Subject: [PATCH 12/12] close hfoption block --- docs/source/en/model_doc/electra.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/source/en/model_doc/electra.md b/docs/source/en/model_doc/electra.md index c189dc3345ac..9506d6dba1c0 100644 --- a/docs/source/en/model_doc/electra.md +++ b/docs/source/en/model_doc/electra.md @@ -84,6 +84,9 @@ print(f"Predicted label: {predicted_label}") echo -e "This restaurant has amazing food." | transformers-cli run --task text-classification --model bhadresh-savani/electra-base-emotion --device 0 ``` + + + ## Notes - ELECTRA consists of two transformer models, a generator (G) and a discriminator (D). For most downstream tasks, use the discriminator model (as indicated by `*-discriminator` in the name) rather than the generator.