support new marian models by patil-suraj · Pull Request #15831 · huggingface/transformers

patil-suraj · 2022-02-25T13:27:44Z

What does this PR do?

This PR updates the Marian model:

To allow not sharing embeddings between encoder and decoder.
Allow tying only decoder embeddings with lm_head.
Separate two vocabs in tokenizer for src and tgt language

To support this, the PR introduces the following new methods:

get_decoder_input_embeddings and set_decoder_input_embeddings
To get and set the decoder embeddings when the embeddings are not shared. These methods will raise an error if the embeddings are shared.
resize_decoder_token_embeddings
To only resize the decoder embeddings. Will raise an error if the embeddings are shared.

This PR also adds two new config attributes to MarianConfig:

share_encoder_decoder_embeddings: to indicate if emb should be shared or not
decoder_vocab_size: to specify the vocab size for decoder when emb are not shared.

And the following methods from PreTrainedModel class are overridden to support these changes:

tie_weights
_resize_token_embeddings

Fixes #15109

patil-suraj · 2022-02-25T13:37:20Z

+        # if word embeddings are not tied, make sure that lm head is resized as well
+        if (
+            self.config.share_encoder_decoder_embeddings
+            and self.get_output_embeddings() is not None
+            and not self.config.tie_word_embeddings
+        ):
+            old_lm_head = self.get_output_embeddings()
+            new_lm_head = self._get_resized_lm_head(old_lm_head, new_num_tokens)
+            self.set_output_embeddings(new_lm_head)


This will only resize the lm_head if embeddings are shared.

patil-suraj · 2022-02-25T13:38:08Z

+            # if embeddings are shared this will return shared embeddings otherwise decoder embed_tokens
+            word_embeddings = self.get_decoder().get_input_embeddings()
+            self._tie_or_clone_weights(output_embeddings, word_embeddings)


We always return decoder embeddings here. This should work for both cases, shared or not shared.

HuggingFaceDocBuilder · 2022-02-25T13:56:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

patrickvonplaten · 2022-02-25T14:11:32Z

+
+    def get_decoder_input_embeddings(self):
+        if self.config.share_encoder_decoder_embeddings:
+            raise ValueError(


Why raise an error here? It's totally fine to just return self.get_input_embeddigs() in this case no?

Still don't think we need to raise here ;-)

patrickvonplaten

Overall, I'm in favor of adding the new Marian checkpoints the way it is shown here. The change from having a Marian model that always force-tied encoder and decoder embeddings to a Marian model that can switch between force-tied and no-tied encoder input embeddings and encoder output embeddings is the better option here IMO even though it does go a bit again our philosophy of not changing existing model code.

The main reasons why I'm in favor of the approach as it's implemented now are (with the feedback given below):

All the changes of this PR are also applicable to existing Marian V1 checkpoints. More specifically all Marian V1 checkpoints can be loaded here with share_encoder_decoder_embeddings=False and then fine-tuned with embeddings not being tied.
Marian V2 comes from the exact same library as Marian V1 and is the same model. Creating a new name here (Marian V2) could confuse users.

Thoughts @LysandreJik @sgugger ?

sgugger

Ok for me. It's really pushing the test for a new model to its limit, but I understand the arguments to keep it in the same model.

patrickvonplaten

Looks good to me in general.
Left a couple of comments.

Also given that a bunch of new model checkpoints will be added here - let's maybe add a slow integration test as well?

patil-suraj added 4 commits February 25, 2022 14:26

support not sharing embeddings

bec07d9

update modeling

25d5bbc

update tokenizer

67321da

fix conversion script

ebe122e

patil-suraj commented Feb 25, 2022

View reviewed changes

patil-suraj requested a review from patrickvonplaten February 25, 2022 13:44