add new model prophetnet#6187
add new model prophetnet#6187qiweizhen wants to merge 4 commits intohuggingface:masterfrom qiweizhen:master
Conversation
There was a problem hiding this comment.
Thanks @qiweizhen ! It's really exciting to see ProphetNet since it's been a while not having a big language integration like this!
You can add the link and TL;DR of your paper in the README (and make sure to do so). Also, please write some unit tests to make sure the model works as expected (of course you can do that after we have done several rounds of reviews re. the modeling and tokenizer themselves). Please consider adding TensorFlow model once the pytorch one is shipped. Also, please complete the model cards of your uploaded weights!
| def __new__(cls, **kwargs): | ||
| xprophetnet_tokenizer = False if 'xprophetnet_tokenizer' not in kwargs.keys() else kwargs['xprophetnet_tokenizer'] | ||
| if xprophetnet_tokenizer: | ||
| super_class = XLMRobertaTokenizer | ||
| else: | ||
| super_class = BertTokenizer | ||
| cls = type(cls.__name__, (cls, super_class), {}) | ||
| if xprophetnet_tokenizer: | ||
| cls.vocab_files_names = VOCAB_FILES_NAMES_CROSS_LINGUAL | ||
| else: | ||
| cls.vocab_files_names = VOCAB_FILES_NAMES_EN | ||
| return super(ProphetNetTokenizer, cls).__new__(cls) |
There was a problem hiding this comment.
I personally think it's too hacky here. However, wrt the vocab_files_names issue, you can simply choose your own name, e.g., prophet.bpe or prophet.tokenizer. Be creative lol!
Then I think you can put vocabulary_type as a configuration option in the config! There should be no problem then.
There was a problem hiding this comment.
Done as you suggested. Vocab files are now named as prophetnet.tokenization
| **kwargs | ||
| ): | ||
| if not xprophetnet_tokenizer: | ||
| # inherit from BERT tokenizer |
There was a problem hiding this comment.
I would use the word copied from instead of inherit
There was a problem hiding this comment.
As I reviewed the codes in tokenization_bart and other files, it seems inheriting is also acceptable. So, instead of copying every function to modify them as self._tokenizer.func, I think this old code is simpler ant avoid causing bug when copying the codes.
There was a problem hiding this comment.
Well I think both ways are okay but it's also good to hear from @LysandreJik
| ) | ||
| self.unique_no_split_tokens.append("[X_SEP]") | ||
| else: | ||
| # inherit from XLM-R tokenizer |
| inputs = tokenizer([EN_SENTENCE_TO_QUESTION, RU_SENTENCE_TO_QUESTION, ZH_SENTENCE_TO_QUESTION], padding=True, max_length=256, return_tensors='pt') | ||
|
|
||
| # Generate Summary(bos is removed) | ||
| summary_ids = model.generate(inputs['input_ids'][:, 1:], num_beams=4, max_length=100, early_stopping=True) |
There was a problem hiding this comment.
My solution is to simply merge this into the model like input = input[:, 1:]. Users don't need to care about details like this.
There was a problem hiding this comment.
Done as you suggested. Remove bos will be done in the model.
| from .modeling_prophetnet import ( | ||
| ProphetNetModel, | ||
| ProphetNetForConditionalGeneration | ||
| ) |
There was a problem hiding this comment.
I totally understand that ProphetNet is a specialized model for generation but you may also want to add something like ProphetNetForSequenceClassification etc.? BART has them.
There was a problem hiding this comment.
Thank you for this suggestion. I will add ProphetNetForSequenceClassification in near future when I find a suitable way for NLU tasks. For example, I will compare different encoder-decoder NLU functions of T5, BART...
There was a problem hiding this comment.
Agree that it would be nice to also have ProphetNetForSequenceClassification, but I think we could handle this in a new PR. The most important is probably ProphetNetForConditionalGeneration for now
| (ElectraConfig, ElectraForMaskedLM), | ||
| (EncoderDecoderConfig, EncoderDecoderModel), | ||
| (ReformerConfig, ReformerModelWithLMHead), | ||
| (ProphetNetConfig, ProphetNetModel), |
There was a problem hiding this comment.
Is this correct? No need for WithLMHeadModel?
JetRunner
left a comment
There was a problem hiding this comment.
@sshleifer Is there anything else needed to do in order to make ProphetNet work with your seq2seq example?
Also @mfuntowicz for pipelines
I tried examples/seq2seq/finetune.py and it works with python finetune.py --do_train and --do_predict. |
|
I will try to complete document and unit test by this week |
| eps=0.0, | ||
| **common_kwargs | ||
| ): | ||
| if "hidden_size" in common_kwargs: |
There was a problem hiding this comment.
Do we have to call the parameter d_model? It would be great if we could only use hidden_size to avoid confusion. While a lot of previous models have d_model we are trying to be more consistent now with the naming.
| super().__init__(num_embeddings, embedding_dim, padding_idx) | ||
| self.onnx_trace = False | ||
|
|
||
| def forward(self, input, use_cache=False, positions=None): |
There was a problem hiding this comment.
We usually call input, input_ids in all other models. Would be great if you can align the naming
| real_positions = positions | ||
| else: | ||
| real_positions = positions | ||
| return super().forward(positions), real_positions |
There was a problem hiding this comment.
It's a bit confusing to me that two tensors are returned -> could you add a comment explaining why this is the case?
I think I would also prefer to split up the class into a function and a class:
- Function that calculates the "real_positions"
- A very simple
nn.Embedding
I not really sure that we need such a big class. IMO, one can calculate the real_positions via a function, then we would only need nn.Embedding which is much more readable and don't need _forward at all and max_positions is such a small function that we can just copy paste it into the code.
But maybe I overlooked something - What do you think?
It's not that big of a deal though...we can also handle this in a refactor later.
| def _forward(self, positions): | ||
| return super().forward(positions) | ||
|
|
||
| def LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=True, export=False): |
There was a problem hiding this comment.
This is very different to our usual design...having upper case functions is very unexpected for a function (even though the function instantiates a class...). I think a better way here is to do the following at the very top of the code:
try:
from apex.normalization import FusedLayerNorm
ProphetNetLayerNorm = FusedLayerNorm
except ImportError:
ProphetNetLayerNorm = torch.nn.LayerNorm
then the layer norm can be instantiated normally with:
ProphetNetLayerNorm(normalized_shape, eps, elementwise_affine)
At the moment export is always set to False when calling LayerNorm and the other two parameters are also not used...should we put eps and elementwise_affine in the config ? Do we need export? Or could we deleted it?
There was a problem hiding this comment.
Maybe import LayerNorm from another model is a good idea?
There was a problem hiding this comment.
like Bart, which does exactly this :)
| return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine) | ||
|
|
||
| def invert_mask(attention_mask): | ||
| assert attention_mask.dim() == 2 |
There was a problem hiding this comment.
assert message would be nice :-)
There was a problem hiding this comment.
Yes it can be as easy as assert attention_mask.dim() == 2, "some error message"
| embed_dim, | ||
| num_heads, | ||
| dropout=0.0, | ||
| bias=True, |
There was a problem hiding this comment.
maybe change to is_bias so everyone knows it's a flag
There was a problem hiding this comment.
can we reuse bart? This looks identical.
| self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) | ||
| self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias) | ||
| self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias) | ||
| self.cache_key = "encoder_decoder" if self.encoder_decoder_attention else "self" |
There was a problem hiding this comment.
this looks like complicated logic in the following lines :D Maybe add a comment?
| self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias) | ||
| self.cache_key = "encoder_decoder" if self.encoder_decoder_attention else "self" | ||
|
|
||
| def _shape(self, tensor, dim_0, bsz): |
There was a problem hiding this comment.
why is the function name _shape ? -> looks more like _reshape_for_...
There was a problem hiding this comment.
and can we make it static for better readability
There was a problem hiding this comment.
static functions are always less scary IMO :D
| """Input shape: Time(SeqLen) x Batch x Channel""" | ||
| static_kv: bool = self.encoder_decoder_attention | ||
| tgt_len, bsz, embed_dim = query.size() | ||
| assert embed_dim == self.embed_dim |
There was a problem hiding this comment.
assert message would be great
| static_kv: bool = self.encoder_decoder_attention | ||
| tgt_len, bsz, embed_dim = query.size() | ||
| assert embed_dim == self.embed_dim | ||
| assert list(query.size()) == [tgt_len, bsz, embed_dim] |
There was a problem hiding this comment.
assert message would be great
|
|
||
| def forward( | ||
| self, | ||
| query, |
There was a problem hiding this comment.
I think query are the hidden_states here if I'm not mistaken -> we usually call the input embedding representations hidden_states...by query I would think of the query projection which could be a bit misleading.
| v = self._shape(v, -1, bsz) | ||
|
|
||
| if saved_state is not None: | ||
| k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz) |
There was a problem hiding this comment.
(nit) can we try to avoid single letter variables, makes refactoring always very difficult afterwards -> maybe just stick to key, value and query and change the first key to hidden_states_key
There was a problem hiding this comment.
Well I think k, v, q are acceptable but no loss to use a longer name
| "prev_key_padding_mask": key_padding_mask if not static_kv else None, | ||
| } | ||
|
|
||
| assert k is not None |
There was a problem hiding this comment.
assert statements would be great
| assert k is not None | ||
| src_len = k.size(1) | ||
| attn_weights = torch.bmm(q, k.transpose(1, 2)) | ||
| assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len) |
There was a problem hiding this comment.
assert statements would be great
| # This is part of a workaround to get around fork/join parallelism not supporting Optional types. | ||
| if key_padding_mask is not None and key_padding_mask.dim() == 0: | ||
| key_padding_mask = None | ||
| assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,) |
There was a problem hiding this comment.
assert statements would be great
|
|
||
| if key_padding_mask is not None: # don't attend to padding symbols | ||
| attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) | ||
| reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2) |
There was a problem hiding this comment.
(nit) key_padding_mask[:, None, None] is nicer IMO
| reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2) | ||
| attn_weights = attn_weights.masked_fill(reshaped, float("-inf")) | ||
| attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) | ||
| attn_weights = F.softmax(attn_weights, dim=-1) |
There was a problem hiding this comment.
-> attn_probs = F.softmax(attn_weights, dim=-1) after softmax we have probs
| attn_weights = F.softmax(attn_weights, dim=-1) | ||
| attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,) | ||
|
|
||
| assert v is not None |
There was a problem hiding this comment.
assert statement missing
| if self.bias_v is not None: | ||
| nn.init.xavier_normal_(self.bias_v) | ||
|
|
||
| def _relative_positions_bucket(self, relative_positions, bidirectional=False): |
There was a problem hiding this comment.
is_bidirectional is maybe better
| # input attn_weights [T*head,T,S] | ||
| # input real_positions [B,T] or [1,1] | ||
|
|
||
| T, B, _ = query.size() |
There was a problem hiding this comment.
not a fan of single letters here...but I guess OK for now
| else: | ||
| saved_state = None | ||
| layer_state = {} | ||
|
|
There was a problem hiding this comment.
If I read the function correctly the function arguments key and value are not used -> can we delete them from the function signature?
| output_attentions = False | ||
| ): | ||
|
|
||
| tgt_len, bsz, embed_dim = query.size() |
There was a problem hiding this comment.
can we rename query to hidden_states here as well
|
|
||
| if self.bias_k is not None: | ||
| assert self.bias_v is not None | ||
| k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)]) |
There was a problem hiding this comment.
as a sidenot, it's always good to try whether expand works before using repeat -> repeat always reallocated new memory which can become heavy depending on how big the tensors are
| self_attn_mask = self_attn_mask.unsqueeze(0) | ||
| attn_weights_main = attn_weights_main + self_attn_mask | ||
|
|
||
| attn_weights_main = softmax( |
There was a problem hiding this comment.
attn_probs_main would be nicer
| attn_result.append(attn_ngram) | ||
|
|
||
| # [1+ngram*T, B, C] | ||
| attn = torch.cat(attn_result, 0).view(-1, bsz, embed_dim) |
There was a problem hiding this comment.
would be simpler and more readable to just do attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim) and delete 3 lines above IMO.
| def _get_input_buffer(self, incremental_state): | ||
| return {} | ||
|
|
||
| def _set_input_buffer(self, incremental_state, buffer): |
There was a problem hiding this comment.
this function does not seem to do anything
|
|
||
| def reorder_incremental_state(self, incremental_state, new_order): | ||
| """Reorder buffered internal state (for incremental generation).""" | ||
| input_buffer = self._get_input_buffer(incremental_state) |
There was a problem hiding this comment.
input_buffer = {} instead or is _get_input_buffer not fully implemented yet?
|
|
||
| def forward( | ||
| self, | ||
| x, |
There was a problem hiding this comment.
hidden_states instead of x
| self.embed_positions = LearnedPositionalEmbedding(config.max_position_embeddings + 2 + self.padding_idx, | ||
| embed_dim, self.padding_idx) | ||
| self.ngram_input_embed = nn.Embedding(self.ngram, embed_dim, None) | ||
| self.layers = nn.ModuleList([]) |
There was a problem hiding this comment.
can do everything in one line IMO -> no need for self.layers = nn.ModuleList([])
| i_buckets_main_stream, i_bucket_relative_stream = \ | ||
| self.cal_finetune_relative_positions(real_positions) | ||
| predicting_stream_pos_embed = self.embed_positions._forward(real_positions + 1) | ||
| x = self.embed_tokens(input_ids) |
There was a problem hiding this comment.
x -> hidden_states
| next_cache = None | ||
| return x_list, next_cache, all_hidden_states, list(all_self_attns) | ||
|
|
||
|
|
There was a problem hiding this comment.
Delete emtpy lines
|
|
||
| padding_idx, vocab_size, dim_size = config.pad_token_id, config.vocab_size, config.d_model | ||
| self.embed_tokens = nn.Embedding(vocab_size, dim_size, padding_idx=padding_idx) | ||
| nn.init.normal_(self.embed_tokens.weight, mean=0, std=dim_size ** -0.5) |
There was a problem hiding this comment.
what is this doing ? -> normally we handle the init in the ProphetNetPreTrainedModel, such as here:
| base_model = ProphetNetModel(config) | ||
| self.model = base_model | ||
| #self.padding_idx = config.pad_token_id | ||
| self.padding_idx = -100 |
There was a problem hiding this comment.
why -100 and not config.pad_token_id? -100 is the token we usually use to disable the CE loss, not sure if this is set intentionally to this number here..
| output_attentions=None, | ||
| **unused, | ||
| ): | ||
| if "lm_labels" in unused: |
There was a problem hiding this comment.
You can delete the whole if "lm_labels" ... -> since lm_labels was never there in the first place we should not allow the user to use it
| lprobs, | ||
| expend_targets.view(-1), | ||
| reduction='sum', | ||
| ignore_index=self.padding_idx, |
There was a problem hiding this comment.
ah ok, I see where the -100 comes from. I think -100 is the default value for, so you might not have to put it here, also can we call it ignore_loss_token_id instead?
| if labels is not None: | ||
| # fine-tune | ||
| expend_targets = labels.new_zeros(self.config.ngram, labels.size(0), labels.size(1)).fill_(self.padding_idx) | ||
| for i in range(self.config.ngram): |
There was a problem hiding this comment.
can we try not to use the for look here? It should be quite easy to get rid of it here I think
| return new_key_padding_mask | ||
|
|
||
|
|
||
| class EncoderLayer(nn.Module): |
There was a problem hiding this comment.
I think we have to split the class up into more classes. If you take a look at BERT:
you can see that theBertLayer is much more modularized.
Note that once we have the layer names defined, such as self.fc1 we cannot really change them any more afterwards because the trained weights are tied to it. Have a more modularized layer has the advantage that it's much easier and cleaner to add new features change the logic at a later stage.
Say we want to add another version which has the layer norm before the feed forward layers. As it is now we would have to add if statements which are not very clean. If we instead modularize the feed forward part directly, such as self.feed_forward = FeedForwardLayer(...)we could later simply add anotherFeedForwardLayerNew(...)and do if config....self.feed_forward = FeedForwardLayer(..) else self.feed_forward = FeedForwardLayerNew(...)`. This allowed us to simply device many models from BERT and makes the model in general much more flexible. I strongly recommend to do this here as well...if this means that it does not fit with your current weights, we can simply write a conversion script to convert the weights (takes 15min, I can help you :-))
This has a big advantage if we decide to add new features to the model at a later stage...say for example the feed forward layers would change. As it is implemented now we could not
| if self.bias_v is not None: | ||
| nn.init.xavier_normal_(self.bias_v) | ||
|
|
||
| def _relative_positions_bucket(self, relative_positions, bidirectional=False): |
There was a problem hiding this comment.
I'd make this static as well and instert the self... params -> this is a classic function we should test
There was a problem hiding this comment.
Hey @qiweizhen,
Thanks so much for this PR! I read your paper today - congrats for the amazing results! I'm very excited to integrate this into the library.
Sorry that I left so many comments - feel free to only take those into account that you think are necessary - a lot of these comments are nits.
IMO we should focus on these 4 things before merging:
-
Let's make the layers more flexible to future enhancement/changes by modularizing them more (see my comment on line 358). I think we should use
Bertas the role model. HavingProphetNetSelfAttentionProphetNetAttentionOutput,ProphetNetIntermediate,ProphetNetOutputwould make the code much more flexible. -
We should try to not use Python and Numpy functions at all (Python's
forloop, e.g.). Having for loops in theforwardfunctions can render the code very slow on CUDA. Also @mfuntowicz here.
This is not too big of a problem in the beginning as it does not require breaking change later and can be improved in a second PR though. -
more important. It would be great if we can try to change all functions that do not use
selfto static functions. It has two big advantages: 1) The reader knows the function does not require any model inherit parameters and is probably quite easy. 2) The functions can very easily be tested -> the model does not have to be instantiated to test these functions, which makes it much easier to find bugs and have good tests. -
VERY IMPORTANT - Tests! This model is very complex and I think in order to be able to maintain it, we need integration tests, as is done a lot in Bart, e.g.:
. The more tests, the better...Without tests, it will be very hard to refactor the model at a later stage and to know whether it works correctly. Also testing model specific static functions, liketransformers/tests/test_modeling_bart.py
Line 417 in 31da35c
_relative_positions_bucketand single layers is extremely useful! First, people have a much easier time understanding these layers and function by seeing an input and output and 2nd, bugs can be pinned in much more detail. I started doing this a lot for Longformer now: and think it's paying of a lot.
If it requires too much work to do all these integration tests, no worry - we can do it later...we should have at least 4,5 integration tests for the pretrained models forProphetNetForConditionalGeneration.
The other comments are mostly nits. Sorry, if all these comments are a bit overwhelming and if it sounds like too much work for you, don't worry - we can take care of it as well, especially like the performance improvements of 2).
Thanks a lot for adding the model - very excited to have the community fine-tune the model on a bunch of tasks :-)
|
@patrickvonplaten Thanks for your review! I learned a lot, too. @qiweizhen Please be free to contact me for discussion via WeChat if you have trouble understanding Patrick's comments or you want to have another person to double-check! Thanks for your great work! |
There was a problem hiding this comment.
Thanks for the new model!
- I would add tests and an integration test like in
test_modeling_bart.py. (maybe I missed these). - and in general try to reuse components from
Bart, which is also copied from fairseq and I think besides your NGramAttention is very similar. (Not a priority, only if it makes your life easier.)
(I meant to comment, not Approve).
| EN_SENTENCE = "Google left China in 2010" | ||
| ZH_SENTENCE = "Google在2010年离开中国" |
There was a problem hiding this comment.
Long live Microsoft lol!
|
|
||
| class EncoderLayer(nn.Module): | ||
| """ | ||
| Same to Transformer Encoder Layer |
There was a problem hiding this comment.
| Same to Transformer Encoder Layer | |
| Same as Transformer Encoder Layer |
|
|
||
| # [1+ngram*T, B, C] | ||
| attn = torch.cat(attn_result, 0).view(-1, bsz, embed_dim) | ||
| attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim) |
There was a problem hiding this comment.
| attn = torch.cat([attn_main, attn_ngram], 0).view(-1, bsz, embed_dim) | |
| attn = torch.cat((attn_main, attn_ngram), 0).view(-1, bsz, embed_dim) |
| residual = x | ||
| x, _ = self.encoder_attn( | ||
| query=x, | ||
| hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) |
There was a problem hiding this comment.
I think you can make self.dropout = Dropout(p=self.dropout) and then directly call it here?
| hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training) | ||
| hidden_states = self.fc2(hidden_states) | ||
| hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) |
There was a problem hiding this comment.
Same here for the dropouts
| return result | ||
|
|
||
|
|
||
| def cal_relative_positions_buckets(num_buckets, max_distance, real_positions): |
There was a problem hiding this comment.
I'm a little confused with the names of _relative_positions_bucket and cal_relative_positions_buckets.
| return i_buckets_main_stream, i_bucket_relative_stream | ||
|
|
||
| def cal_finetune_relative_positions(self, real_positions): | ||
| def cal_and_buffer_finetune_relative_positions(self, real_positions): |
There was a problem hiding this comment.
| def cal_and_buffer_finetune_relative_positions(self, real_positions): | |
| def cal_and_cache_finetune_relative_positions(self, real_positions): |
| ngram_input_embed = self.ngram_input_embed.weight | ||
| if use_cache: | ||
| B = x.size(1) | ||
| B = hidden_states.size(1) |
There was a problem hiding this comment.
Don't use capitalized letter for variable name
| ngram_mask_matrix = self.buffered_future_mask_ngram(hidden_states) | ||
| # TODO in train [(1+ngram)*T, B, C], in inference [T+ngram, B, C] | ||
| x = torch.cat([x] + ngram_masks, 0) | ||
| hidden_states = torch.cat([hidden_states] + ngram_masks, 0) |
There was a problem hiding this comment.
| hidden_states = torch.cat([hidden_states] + ngram_masks, 0) | |
| hidden_states = torch.cat((hidden_states) + ngram_masks, 0) |
Add new model structure ProphetNet.
Description:
ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction. ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at github repo.
xProphetNet has the same model structure, but is pretrained with wikipedia 100 languages dataset as described in xGLUE. xGLUE is a benchmark for corss-lingual NLU and NLG tasks. xProphetNet is also served as a baseline model for cross-lingual generation tasks in xGLUE NTG and QG.
Usage:
Take xGLUE NTG task as an example:
Cross-lingual pretrained model is finetuned with English news title generation data, but inference with both English and other zero-shot language data.
A quick usage is like:
Model will generate news titles like:
Released checkpoints:
pretrained:
fine-tuned: