add new model prophetnet by qiweizhen · Pull Request #6187 · huggingface/transformers

qiweizhen · 2020-08-01T12:43:52Z

Add new model structure ProphetNet.

Description:

ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction. ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at github repo.
xProphetNet has the same model structure, but is pretrained with wikipedia 100 languages dataset as described in xGLUE. xGLUE is a benchmark for corss-lingual NLU and NLG tasks. xProphetNet is also served as a baseline model for cross-lingual generation tasks in xGLUE NTG and QG.

Usage:

Take xGLUE NTG task as an example:
Cross-lingual pretrained model is finetuned with English news title generation data, but inference with both English and other zero-shot language data.
A quick usage is like:

from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig

model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-ntg')
tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-ntg')

EN_SENTENCE_TO_QUESTION = "Microsoft Corporation intends to officially end free support for the Windows 7 operating system after January 14, 2020, according to the official portal of the organization. From that day, users of this system will not be able to receive security updates, which could make their computers vulnerable to cyber attacks."
RU_SENTENCE_TO_QUESTION = "орпорация Microsoft намерена официально прекратить бесплатную поддержку операционной системы Windows 7 после 14 января 2020 года, сообщается на официальном портале организации . С указанного дня пользователи этой системы не смогут получать обновления безопасности, из-за чего их компьютеры могут стать уязвимыми к кибератакам."
ZH_SENTENCE_TO_QUESTION = "根据该组织的官方门户网站，微软公司打算在2020年1月14日之后正式终止对Windows 7操作系统的免费支持。从那时起，该系统的用户将无法接收安全更新，这可能会使他们的计算机容易受到网络攻击。"
inputs = tokenizer([EN_SENTENCE_TO_QUESTION, RU_SENTENCE_TO_QUESTION, ZH_SENTENCE_TO_QUESTION], padding=True, max_length=256, return_tensors='pt')

summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)
print([tokenizer.decode(g) for g in summary_ids])

Model will generate news titles like:

['[SEP] Microsoft to end Windows 7 free support after January 14, 2020[SEP][PAD][PAD][PAD][PAD]',
 '[SEP] Microsoft намерена прекратить бесплатную поддержку Windows 7 после 14 января 2020 года[SEP]',
 '[SEP]微软打算终止对Windows 7操作系统的免费支持[SEP][PAD][PAD][PAD][PAD][PAD][PAD]']

Released checkpoints:

pretrained:

microsoft/prophetnet-large-uncased
microsoft/xprophetnet-large-wiki100-cased

fine-tuned:

microsoft/prophetnet-large-uncased-cnndm
microsoft/xprophetnet-large-wiki100-cased-xglue-ntg
microsoft/xprophetnet-large-wiki100-cased-xglue-qg

JetRunner

Thanks @qiweizhen ! It's really exciting to see ProphetNet since it's been a while not having a big language integration like this!

You can add the link and TL;DR of your paper in the README (and make sure to do so). Also, please write some unit tests to make sure the model works as expected (of course you can do that after we have done several rounds of reviews re. the modeling and tokenizer themselves). Please consider adding TensorFlow model once the pytorch one is shipped. Also, please complete the model cards of your uploaded weights!

JetRunner · 2020-08-02T08:39:11Z

+    def __new__(cls, **kwargs):
+        xprophetnet_tokenizer = False if 'xprophetnet_tokenizer' not in kwargs.keys() else kwargs['xprophetnet_tokenizer']
+        if xprophetnet_tokenizer:
+            super_class = XLMRobertaTokenizer
+        else:
+            super_class = BertTokenizer
+        cls = type(cls.__name__, (cls, super_class), {})
+        if xprophetnet_tokenizer:
+            cls.vocab_files_names = VOCAB_FILES_NAMES_CROSS_LINGUAL
+        else:
+            cls.vocab_files_names = VOCAB_FILES_NAMES_EN
+        return super(ProphetNetTokenizer, cls).__new__(cls)


I personally think it's too hacky here. However, wrt the vocab_files_names issue, you can simply choose your own name, e.g., prophet.bpe or prophet.tokenizer. Be creative lol!

Then I think you can put vocabulary_type as a configuration option in the config! There should be no problem then.

Done as you suggested. Vocab files are now named as prophetnet.tokenization

JetRunner · 2020-08-02T08:39:51Z

+        **kwargs
+    ):
+        if not xprophetnet_tokenizer:
+            # inherit from BERT tokenizer


I would use the word copied from instead of inherit

As I reviewed the codes in tokenization_bart and other files, it seems inheriting is also acceptable. So, instead of copying every function to modify them as self._tokenizer.func, I think this old code is simpler ant avoid causing bug when copying the codes.

Well I think both ways are okay but it's also good to hear from @LysandreJik

JetRunner · 2020-08-02T08:39:57Z

+            )
+            self.unique_no_split_tokens.append("[X_SEP]")
+        else:
+            # inherit from XLM-R tokenizer


JetRunner · 2020-08-02T08:42:53Z

+        inputs = tokenizer([EN_SENTENCE_TO_QUESTION, RU_SENTENCE_TO_QUESTION, ZH_SENTENCE_TO_QUESTION], padding=True, max_length=256, return_tensors='pt')
+
+        # Generate Summary(bos is removed)
+        summary_ids = model.generate(inputs['input_ids'][:, 1:], num_beams=4, max_length=100, early_stopping=True)


My solution is to simply merge this into the model like input = input[:, 1:]. Users don't need to care about details like this.

Done as you suggested. Remove bos will be done in the model.

JetRunner · 2020-08-02T08:44:31Z

+from .modeling_prophetnet import (
+    ProphetNetModel,
+    ProphetNetForConditionalGeneration
+)


I totally understand that ProphetNet is a specialized model for generation but you may also want to add something like ProphetNetForSequenceClassification etc.? BART has them.

Thank you for this suggestion. I will add ProphetNetForSequenceClassification in near future when I find a suitable way for NLU tasks. For example, I will compare different encoder-decoder NLU functions of T5, BART...

Agree that it would be nice to also have ProphetNetForSequenceClassification, but I think we could handle this in a new PR. The most important is probably ProphetNetForConditionalGeneration for now

JetRunner · 2020-08-02T08:46:21Z

        (ElectraConfig, ElectraForMaskedLM),
        (EncoderDecoderConfig, EncoderDecoderModel),
        (ReformerConfig, ReformerModelWithLMHead),
+        (ProphetNetConfig, ProphetNetModel),


Is this correct? No need for WithLMHeadModel?

Deleted this line

JetRunner

@sshleifer Is there anything else needed to do in order to make ProphetNet work with your seq2seq example?

Also @mfuntowicz for pipelines

qiweizhen · 2020-08-05T06:30:33Z

@sshleifer Is there anything else needed to do in order to make ProphetNet work with your seq2seq example?

Also @mfuntowicz for pipelines

I tried examples/seq2seq/finetune.py and it works with python finetune.py --do_train and --do_predict.

qiweizhen · 2020-08-05T06:34:40Z

I will try to complete document and unit test by this week

patrickvonplaten · 2020-08-05T21:19:10Z

+            eps=0.0,
+            **common_kwargs
+    ):
+        if "hidden_size" in common_kwargs:


Do we have to call the parameter d_model? It would be great if we could only use hidden_size to avoid confusion. While a lot of previous models have d_model we are trying to be more consistent now with the naming.

patrickvonplaten · 2020-08-05T21:25:03Z

+        super().__init__(num_embeddings, embedding_dim, padding_idx)
+        self.onnx_trace = False
+
+    def forward(self, input, use_cache=False, positions=None):


We usually call input, input_ids in all other models. Would be great if you can align the naming

patrickvonplaten · 2020-08-05T21:35:41Z

+            real_positions = positions
+        else:
+            real_positions = positions
+        return super().forward(positions), real_positions


It's a bit confusing to me that two tensors are returned -> could you add a comment explaining why this is the case?
I think I would also prefer to split up the class into a function and a class:

Function that calculates the "real_positions"

A very simple nn.Embedding

I not really sure that we need such a big class. IMO, one can calculate the real_positions via a function, then we would only need nn.Embedding which is much more readable and don't need _forward at all and max_positions is such a small function that we can just copy paste it into the code.
But maybe I overlooked something - What do you think?

It's not that big of a deal though...we can also handle this in a refactor later.

patrickvonplaten · 2020-08-05T21:42:53Z

+    def _forward(self, positions):
+        return super().forward(positions)
+
+def LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True, export=False):


This is very different to our usual design...having upper case functions is very unexpected for a function (even though the function instantiates a class...). I think a better way here is to do the following at the very top of the code:

try: from apex.normalization import FusedLayerNorm ProphetNetLayerNorm = FusedLayerNorm except ImportError: ProphetNetLayerNorm = torch.nn.LayerNorm

then the layer norm can be instantiated normally with:

ProphetNetLayerNorm(normalized_shape, eps, elementwise_affine)

At the moment export is always set to False when calling LayerNorm and the other two parameters are also not used...should we put eps and elementwise_affine in the config ? Do we need export? Or could we deleted it?

Maybe import LayerNorm from another model is a good idea?

like Bart, which does exactly this :)

patrickvonplaten · 2020-08-05T21:43:05Z

+    return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)
+
+def invert_mask(attention_mask):
+    assert attention_mask.dim() == 2


assert message would be nice :-)

Yes it can be as easy as assert attention_mask.dim() == 2, "some error message"

patrickvonplaten · 2020-08-05T21:43:55Z

+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        bias=True,


maybe change to is_bias so everyone knows it's a flag

can we reuse bart? This looks identical.

patrickvonplaten · 2020-08-05T21:44:44Z

+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.cache_key = "encoder_decoder" if self.encoder_decoder_attention else "self"


this looks like complicated logic in the following lines :D Maybe add a comment?

patrickvonplaten · 2020-08-05T21:46:03Z

+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.cache_key = "encoder_decoder" if self.encoder_decoder_attention else "self"
+
+    def _shape(self, tensor, dim_0, bsz):


why is the function name _shape ? -> looks more like _reshape_for_...

and can we make it static for better readability

static functions are always less scary IMO :D

patrickvonplaten · 2020-08-05T21:46:31Z

+        """Input shape: Time(SeqLen) x Batch x Channel"""
+        static_kv: bool = self.encoder_decoder_attention
+        tgt_len, bsz, embed_dim = query.size()
+        assert embed_dim == self.embed_dim


assert message would be great

patrickvonplaten · 2020-08-05T21:46:37Z

+        static_kv: bool = self.encoder_decoder_attention
+        tgt_len, bsz, embed_dim = query.size()
+        assert embed_dim == self.embed_dim
+        assert list(query.size()) == [tgt_len, bsz, embed_dim]


assert message would be great

patrickvonplaten · 2020-08-05T21:50:43Z

+
+    def forward(
+        self,
+        query,


I think query are the hidden_states here if I'm not mistaken -> we usually call the input embedding representations hidden_states...by query I would think of the query projection which could be a bit misleading.

patrickvonplaten · 2020-08-05T21:54:13Z

+            v = self._shape(v, -1, bsz)
+
+        if saved_state is not None:
+            k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz)


(nit) can we try to avoid single letter variables, makes refactoring always very difficult afterwards -> maybe just stick to key, value and query and change the first key to hidden_states_key

Well I think k, v, q are acceptable but no loss to use a longer name

patrickvonplaten · 2020-08-05T21:54:43Z

+            "prev_key_padding_mask": key_padding_mask if not static_kv else None,
+        }
+
+        assert k is not None


assert statements would be great

patrickvonplaten · 2020-08-05T21:54:49Z

+        assert k is not None
+        src_len = k.size(1)
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len)


assert statements would be great

patrickvonplaten · 2020-08-05T21:55:01Z

+        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.dim() == 0:
+            key_padding_mask = None
+        assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,)


assert statements would be great

patrickvonplaten · 2020-08-05T21:55:41Z

+
+        if key_padding_mask is not None:  # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)


(nit) key_padding_mask[:, None, None] is nicer IMO

patrickvonplaten · 2020-08-05T21:56:17Z

+            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)
+            attn_weights = attn_weights.masked_fill(reshaped, float("-inf"))
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        attn_weights = F.softmax(attn_weights, dim=-1)


-> attn_probs = F.softmax(attn_weights, dim=-1) after softmax we have probs

patrickvonplaten · 2020-08-05T21:56:31Z

+        attn_weights = F.softmax(attn_weights, dim=-1)
+        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,)
+
+        assert v is not None


assert statement missing

patrickvonplaten · 2020-08-05T22:14:21Z

+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+
+    def _relative_positions_bucket(self, relative_positions, bidirectional=False):


is_bidirectional is maybe better

patrickvonplaten · 2020-08-05T22:15:23Z

+        # input attn_weights [T*head,T,S]
+        # input real_positions [B,T] or [1,1]
+
+        T, B, _ = query.size()


not a fan of single letters here...but I guess OK for now

patrickvonplaten · 2020-08-05T22:19:26Z

+        else:
+            saved_state = None
+            layer_state = {}
+


If I read the function correctly the function arguments key and value are not used -> can we delete them from the function signature?

patrickvonplaten · 2020-08-05T22:19:46Z

+                output_attentions = False
+                ):
+
+        tgt_len, bsz, embed_dim = query.size()


can we rename query to hidden_states here as well

patrickvonplaten · 2020-08-05T22:20:53Z

+
+        if self.bias_k is not None:
+            assert self.bias_v is not None
+            k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)])


as a sidenot, it's always good to try whether expand works before using repeat -> repeat always reallocated new memory which can become heavy depending on how big the tensors are

patrickvonplaten · 2020-08-05T22:21:19Z

+            self_attn_mask = self_attn_mask.unsqueeze(0)
+            attn_weights_main = attn_weights_main + self_attn_mask
+
+        attn_weights_main = softmax(


attn_probs_main would be nicer

patrickvonplaten · 2020-08-05T22:22:47Z

+        attn_result.append(attn_ngram)
+
+        # [1+ngram*T, B, C]
+        attn = torch.cat(attn_result, 0).view(-1, bsz, embed_dim)


would be simpler and more readable to just do attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim) and delete 3 lines above IMO.

patrickvonplaten · 2020-08-05T22:23:38Z

+    def _get_input_buffer(self, incremental_state):
+        return {}
+
+    def _set_input_buffer(self, incremental_state, buffer):


this function does not seem to do anything

patrickvonplaten · 2020-08-05T22:24:04Z

+
+    def reorder_incremental_state(self, incremental_state, new_order):
+        """Reorder buffered internal state (for incremental generation)."""
+        input_buffer = self._get_input_buffer(incremental_state)


input_buffer = {} instead or is _get_input_buffer not fully implemented yet?

patrickvonplaten · 2020-08-05T22:24:36Z

+
+    def forward(
+        self,
+        x,


hidden_states instead of x

patrickvonplaten · 2020-08-05T22:25:30Z

+        self.embed_positions = LearnedPositionalEmbedding(config.max_position_embeddings + 2 + self.padding_idx,
+                                                          embed_dim, self.padding_idx)
+        self.ngram_input_embed = nn.Embedding(self.ngram, embed_dim, None)
+        self.layers = nn.ModuleList([])


can do everything in one line IMO -> no need for self.layers = nn.ModuleList([])

patrickvonplaten · 2020-08-05T22:26:29Z

+            i_buckets_main_stream, i_bucket_relative_stream = \
+                self.cal_finetune_relative_positions(real_positions)
+        predicting_stream_pos_embed = self.embed_positions._forward(real_positions + 1)
+        x = self.embed_tokens(input_ids)


x -> hidden_states

patrickvonplaten · 2020-08-05T22:26:43Z

+            next_cache = None
+        return x_list, next_cache, all_hidden_states, list(all_self_attns)
+
+


Delete emtpy lines

patrickvonplaten · 2020-08-05T22:28:44Z

+
+        padding_idx, vocab_size, dim_size = config.pad_token_id, config.vocab_size, config.d_model
+        self.embed_tokens = nn.Embedding(vocab_size, dim_size, padding_idx=padding_idx)
+        nn.init.normal_(self.embed_tokens.weight, mean=0, std=dim_size ** -0.5)


what is this doing ? -> normally we handle the init in the ProphetNetPreTrainedModel, such as here:

transformers/src/transformers/modeling_bert.py

Line 575 in 31da35c

def _init_weights(self, module):

patrickvonplaten · 2020-08-05T22:30:02Z

+        base_model = ProphetNetModel(config)
+        self.model = base_model
+        #self.padding_idx = config.pad_token_id
+        self.padding_idx = -100


why -100 and not config.pad_token_id? -100 is the token we usually use to disable the CE loss, not sure if this is set intentionally to this number here..

patrickvonplaten · 2020-08-05T22:31:00Z

+        output_attentions=None,
+        **unused,
+    ):
+        if "lm_labels" in unused:


You can delete the whole if "lm_labels" ... -> since lm_labels was never there in the first place we should not allow the user to use it

patrickvonplaten · 2020-08-05T22:33:19Z

+                lprobs,
+                expend_targets.view(-1),
+                reduction='sum',
+                ignore_index=self.padding_idx,


ah ok, I see where the -100 comes from. I think -100 is the default value for, so you might not have to put it here, also can we call it ignore_loss_token_id instead?

patrickvonplaten · 2020-08-05T22:34:32Z

+        if labels is not None:
+            # fine-tune
+            expend_targets = labels.new_zeros(self.config.ngram, labels.size(0), labels.size(1)).fill_(self.padding_idx)
+            for i in range(self.config.ngram):


can we try not to use the for look here? It should be quite easy to get rid of it here I think

patrickvonplaten · 2020-08-05T22:46:45Z

+        return new_key_padding_mask
+
+
+class EncoderLayer(nn.Module):


I think we have to split the class up into more classes. If you take a look at BERT:

transformers/src/transformers/modeling_bert.py

Line 376 in 31da35c

class BertLayer(nn.Module):

you can see that the BertLayer is much more modularized.

Note that once we have the layer names defined, such as self.fc1 we cannot really change them any more afterwards because the trained weights are tied to it. Have a more modularized layer has the advantage that it's much easier and cleaner to add new features change the logic at a later stage.

Say we want to add another version which has the layer norm before the feed forward layers. As it is now we would have to add if statements which are not very clean. If we instead modularize the feed forward part directly, such as self.feed_forward = FeedForwardLayer(...)we could later simply add anotherFeedForwardLayerNew(...)and do if config....self.feed_forward = FeedForwardLayer(..) else self.feed_forward = FeedForwardLayerNew(...)`. This allowed us to simply device many models from BERT and makes the model in general much more flexible. I strongly recommend to do this here as well...if this means that it does not fit with your current weights, we can simply write a conversion script to convert the weights (takes 15min, I can help you :-))
This has a big advantage if we decide to add new features to the model at a later stage...say for example the feed forward layers would change. As it is implemented now we could not

patrickvonplaten · 2020-08-05T22:58:57Z

+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+
+    def _relative_positions_bucket(self, relative_positions, bidirectional=False):


I'd make this static as well and instert the self... params -> this is a classic function we should test

patrickvonplaten

Hey @qiweizhen,

Thanks so much for this PR! I read your paper today - congrats for the amazing results! I'm very excited to integrate this into the library.

Sorry that I left so many comments - feel free to only take those into account that you think are necessary - a lot of these comments are nits.

IMO we should focus on these 4 things before merging:

Let's make the layers more flexible to future enhancement/changes by modularizing them more (see my comment on line 358). I think we should use Bert as the role model. Having ProphetNetSelfAttention ProphetNetAttentionOutput, ProphetNetIntermediate, ProphetNetOutput would make the code much more flexible.
We should try to not use Python and Numpy functions at all (Python's for loop, e.g.). Having for loops in the forward functions can render the code very slow on CUDA. Also @mfuntowicz here.
This is not too big of a problem in the beginning as it does not require breaking change later and can be improved in a second PR though.
more important. It would be great if we can try to change all functions that do not use self to static functions. It has two big advantages: 1) The reader knows the function does not require any model inherit parameters and is probably quite easy. 2) The functions can very easily be tested -> the model does not have to be instantiated to test these functions, which makes it much easier to find bugs and have good tests.
VERY IMPORTANT - Tests! This model is very complex and I think in order to be able to maintain it, we need integration tests, as is done a lot in Bart, e.g.:

transformers/tests/test_modeling_bart.py

Line 417 in 31da35c

class BartModelIntegrationTests(unittest.TestCase):

. The more tests, the better...Without tests, it will be very hard to refactor the model at a later stage and to know whether it works correctly. Also testing model specific static functions, like _relative_positions_bucket and single layers is extremely useful! First, people have a much easier time understanding these layers and function by seeing an input and output and 2nd, bugs can be pinned in much more detail. I started doing this a lot for Longformer now:

transformers/tests/test_modeling_longformer.py

Line 380 in f65f87f

def test_diagonalize(self):

and think it's paying of a lot.
If it requires too much work to do all these integration tests, no worry - we can do it later...we should have at least 4,5 integration tests for the pretrained models for ProphetNetForConditionalGeneration.

The other comments are mostly nits. Sorry, if all these comments are a bit overwhelming and if it sounds like too much work for you, don't worry - we can take care of it as well, especially like the performance improvements of 2).

Thanks a lot for adding the model - very excited to have the community fine-tune the model on a bunch of tasks :-)

JetRunner · 2020-08-06T02:13:46Z

@patrickvonplaten Thanks for your review! I learned a lot, too.

@qiweizhen Please be free to contact me for discussion via WeChat if you have trouble understanding Patrick's comments or you want to have another person to double-check! Thanks for your great work!

sshleifer

Thanks for the new model!

I would add tests and an integration test like in test_modeling_bart.py. (maybe I missed these).
and in general try to reuse components from Bart, which is also copied from fairseq and I think besides your NGramAttention is very similar. (Not a priority, only if it makes your life easier.)

(I meant to comment, not Approve).

sshleifer

(deleted)

JetRunner · 2020-08-21T06:36:02Z

+EN_SENTENCE = "Google left China in 2010"
+ZH_SENTENCE = "Google在2010年离开中国"


Long live Microsoft lol!

JetRunner · 2020-08-21T06:45:03Z


 class EncoderLayer(nn.Module):
+    """
+    Same to Transformer Encoder Layer


Suggested change

Same to Transformer Encoder Layer

Same as Transformer Encoder Layer

JetRunner · 2020-08-21T06:55:45Z

-
        # [1+ngram*T, B, C]
-        attn = torch.cat(attn_result, 0).view(-1, bsz, embed_dim)
+        attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim)


Suggested change

attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim)

attn = torch.cat((attn_main, attn_ngram), 0).view(-1, bsz, embed_dim)

JetRunner · 2020-08-21T06:58:09Z

-        residual = x
-        x, _ = self.encoder_attn(
-            query=x,
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)


I think you can make self.dropout = Dropout(p=self.dropout) and then directly call it here?

JetRunner · 2020-08-21T07:01:42Z

+        hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training)
+        hidden_states = self.fc2(hidden_states)
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)


Same here for the dropouts

JetRunner · 2020-08-21T07:04:32Z

+    return result
+
+
+def cal_relative_positions_buckets(num_buckets, max_distance, real_positions):


I'm a little confused with the names of _relative_positions_bucket and cal_relative_positions_buckets.

JetRunner · 2020-08-21T07:05:13Z

-        return i_buckets_main_stream, i_bucket_relative_stream
-
-    def cal_finetune_relative_positions(self, real_positions):
+    def cal_and_buffer_finetune_relative_positions(self, real_positions):


Suggested change

def cal_and_buffer_finetune_relative_positions(self, real_positions):

def cal_and_cache_finetune_relative_positions(self, real_positions):

JetRunner · 2020-08-21T07:07:06Z

        ngram_input_embed = self.ngram_input_embed.weight
        if use_cache:
-            B = x.size(1)
+            B = hidden_states.size(1)


Don't use capitalized letter for variable name

JetRunner · 2020-08-21T07:07:47Z

+            ngram_mask_matrix = self.buffered_future_mask_ngram(hidden_states)
        # TODO in train [(1+ngram)*T, B, C], in inference [T+ngram, B, C]
-        x = torch.cat([x] + ngram_masks, 0)
+        hidden_states = torch.cat([hidden_states] + ngram_masks, 0)


Suggested change

hidden_states = torch.cat([hidden_states] + ngram_masks, 0)

hidden_states = torch.cat((hidden_states) + ngram_masks, 0)

add new model prophetnet

33cae18

JetRunner self-assigned this Aug 2, 2020

JetRunner suggested changes Aug 2, 2020

View reviewed changes

prophetnet modified

0c669d7

julien-c added the model card Related to pretrained model cards label Aug 5, 2020

JetRunner reviewed Aug 5, 2020

View reviewed changes

JetRunner requested review from LysandreJik, mfuntowicz, patrickvonplaten and sshleifer August 5, 2020 06:29

patrickvonplaten reviewed Aug 5, 2020

View reviewed changes

sshleifer approved these changes Aug 6, 2020

View reviewed changes

sshleifer reviewed Aug 6, 2020

View reviewed changes

modify codes as suggested v1

ea252d3

JetRunner reviewed Aug 21, 2020

View reviewed changes

add prophetnet test files

4e435af

patrickvonplaten mentioned this pull request Sep 16, 2020

ProphetNet #7157

Merged

qiweizhen closed this Sep 16, 2020

		next_cache = None
		return x_list, next_cache, all_hidden_states, list(all_self_attns)

		EN_SENTENCE = "Google left China in 2010"
		ZH_SENTENCE = "Google在2010年离开中国"

	Same to Transformer Encoder Layer
	Same as Transformer Encoder Layer

	attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim)
	attn = torch.cat((attn_main, attn_ngram), 0).view(-1, bsz, embed_dim)

		return result


		def cal_relative_positions_buckets(num_buckets, max_distance, real_positions):

	def cal_and_buffer_finetune_relative_positions(self, real_positions):
	def cal_and_cache_finetune_relative_positions(self, real_positions):

	hidden_states = torch.cat([hidden_states] + ngram_masks, 0)
	hidden_states = torch.cat((hidden_states) + ngram_masks, 0)

Conversation

qiweizhen commented Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add new model structure ProphetNet.

Description:

Usage:

Released checkpoints:

Uh oh!

JetRunner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JetRunner left a comment

Choose a reason for hiding this comment

Uh oh!

qiweizhen commented Aug 5, 2020

Uh oh!

qiweizhen commented Aug 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

qiweizhen commented Aug 1, 2020 •

edited

Loading

JetRunner left a comment •

edited

Loading