ProphetNet by qiweizhen · Pull Request #7157 · huggingface/transformers

qiweizhen · 2020-09-16T05:36:08Z

Add ProphetNet.

This PR implements both ProphetNet and XLM-ProphetNet. The model architectures are identical, but each model uses a different tokenizer.

Description:

ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction. ProphetNet is able to predict more future tokens with an n-stream decoder. The original implementation is Fairseq version at github repo.
xProphetNet has the same model structure but is pretrained with wikipedia 100 languages dataset as described in xGLUE. xGLUE is a benchmark for cross-lingual NLU and NLG tasks. xProphetNet is also served as a baseline model for cross-lingual generation tasks in xGLUE NTG and QG.

Usage:

Take xGLUE NTG task as an example:
The cross-lingual pretrained model is finetuned with English news title generation data, but inference with both English and other zero-shot language data.
A quick usage is like:

from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig

model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-ntg')
tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-ntg')

EN_SENTENCE_TO_QUESTION = "Microsoft Corporation intends to officially end free support for the Windows 7 operating system after January 14, 2020, according to the official portal of the organization. From that day, users of this system will not be able to receive security updates, which could make their computers vulnerable to cyber attacks."
RU_SENTENCE_TO_QUESTION = "орпорация Microsoft намерена официально прекратить бесплатную поддержку операционной системы Windows 7 после 14 января 2020 года, сообщается на официальном портале организации . С указанного дня пользователи этой системы не смогут получать обновления безопасности, из-за чего их компьютеры могут стать уязвимыми к кибератакам."
ZH_SENTENCE_TO_QUESTION = "根据该组织的官方门户网站，微软公司打算在2020年1月14日之后正式终止对Windows 7操作系统的免费支持。从那时起，该系统的用户将无法接收安全更新，这可能会使他们的计算机容易受到网络攻击。"
inputs = tokenizer([EN_SENTENCE_TO_QUESTION, RU_SENTENCE_TO_QUESTION, ZH_SENTENCE_TO_QUESTION], padding=True, max_length=256, return_tensors='pt')

summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)
print([tokenizer.decode(g) for g in summary_ids])

Model will generate news titles like:

['[SEP] Microsoft to end Windows 7 free support after January 14, 2020[SEP][PAD][PAD][PAD][PAD]',
 '[SEP] Microsoft намерена прекратить бесплатную поддержку Windows 7 после 14 января 2020 года[SEP]',
 '[SEP]微软打算终止对Windows 7操作系统的免费支持[SEP][PAD][PAD][PAD][PAD][PAD][PAD]']

Released checkpoints:

pretrained:

microsoft/prophetnet-large-uncased
microsoft/xprophetnet-large-wiki100-cased

fine-tuned:

microsoft/prophetnet-large-uncased-cnndm
microsoft/xprophetnet-large-wiki100-cased-xglue-ntg
microsoft/xprophetnet-large-wiki100-cased-xglue-qg

Notes

According to the outputs of original fairseq outputs, integration tests for prophetnet include:

encoder hidden states, decoder hidden states, model hidden states of pretrained Prophetnet, xProphetnet checkpoints
model hidden states of xProphetnet NTG finetuned model
Cross-lingual outputs of xProphetNet NTG finetuned model with different beam sizes
CNN/DM outputs of ProphetNet CNN/DM finetuned model with different input lengths

The model was implemented so all of its parts can be used separately. This means that ProphetNetEncoder and ProphetNetEncoder can be used as stand-alone models. ProphetNetForCausalLM can be instantiated easily from pretrained checkpoints and can be used within the EncoderDecoderModel framework.

prophetnet modified modify codes as suggested v1 add prophetnet test files

qiweizhen · 2020-09-16T06:20:37Z

I opened a wrong PR yesterday, please help me check this version, thanks!
@JetRunner @patrickvonplaten

patrickvonplaten · 2020-09-16T07:43:02Z

@qiweizhen - this looks great! Is this the complete PR? Can we close the "old" PR: #6187 in favor of this one?

patrickvonplaten · 2020-09-16T07:47:05Z

@qiweizhen the Integration tests look great! @JetRunner, I think we can take it from here :-)

I saw that there are models, such "xprophetnet-large-wiki100-cased-xglue-ntg" that are both under microsoft and under weizhen - @qiweizhen are these models identical?

qiweizhen · 2020-09-16T07:51:35Z

This PR is complete version, as I rebased this branch to the latest huggingface version with directions of @JetRunner .

Models under Microsoft are what we actually used. Those under qiweizhen were used to debug. I will delete the models under qiweizhen.

Thank you for your helps @patrickvonplaten @JetRunner

patrickvonplaten · 2020-09-16T07:57:25Z

This PR is complete version, as I rebased this branch to the latest huggingface version with directions of @JetRunner .

Models under Microsoft are what we actually used. Those under qiweizhen were used to debug. I will delete the models under qiweizhen.

Thank you for your helps @patrickvonplaten @JetRunner

Awesome! Thanks a million for your work! We will take it from here :-)

qiweizhen · 2020-09-26T11:58:53Z

@patrickvonplaten Hi, may I ask when could ProphetNet be added into Transformers? Are there any jobs I can co-work to help it be integrated?

patrickvonplaten · 2020-09-27T09:53:01Z

Hey @qiweizhen ,

Sorry for the delay on this. Prophetnet is my no 1 priority next week. It should be merged by the end of next week. You have done your part - I might ping you for some further questions

patrickvonplaten · 2020-09-28T09:29:03Z

@qiweizhen - the integration tests are awesome! Thanks to that it should be quite straightforward to integrate the model

patrickvonplaten · 2020-09-28T11:03:52Z

@qiweizhen - would it be ok for you if we add a ProphetNetModel and a XLMProphetNetModel, each with their respective tokenizers. I think this would be cleaner and is also more in line with Roberta and XLMRoberta for example. I should be quite easy to do this. I can take care of it - would just be great to have your approval on it :-)

qiweizhen · 2020-09-28T11:07:36Z

@qiweizhen - would it be ok for you if we add a ProphetNetModel and a XLMProphetNetModel, each with their respective tokenizers. I think this would be cleaner and is also more in line with Roberta and XLMRoberta for example. I should be quite easy to do this. I can take care of it - would just be great to have your approval on it :-)

Sure! Thank you!

sgugger

Thanks for all the work in the implementation! I'm not a fan of breaking the naming conventions that are in all our modeling files, the building blocks should be prefixed with ProphetNet in my opinion. I'm also wondering why ProphetNetForCausalLM is excluded from the common tests.

The rest is just nits.

sshleifer

Excited to use this! Great contribution!

I wrote comments as if I were reviewing Patrick's code. If anything is written without sufficient explanation or unclear, I'd be happy to clarify.

I read config_, modeling_, and tests.

Things I noticed in pycharm that I didn't include

(didn't write here, all related to modeling_prophetnet):

softmax onnx trace logic:
deleted in bart without issue, but no strong preference.
NgramMultiheadAttention.forward: need_weights kwarg unused
ProphetNetDecoderLayer: output_attentions kwarg unused
why is it called predict_attention_mask instead of decoder_attention_mask
I think main is used instead of encoder also?
There are two sets of logic for preparing causal masks
prepare_attention_mask and prepare_predict_attention_mask.
I think these should both have docstrings/better names. I don't understand their role well enough to know exactly.
In prepare_predict_attention_mask, are we assuming that batches are padded to max_target_positions?
In prepare_predict_attention_mask, why do we expand predict_causal_mask to max_target_positions?
I would type hint that DecoderLayer returns Tuple

sshleifer · 2020-10-16T16:21:59Z

+inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=100, return_tensors='pt')
+
+# Generate Summary
+summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=512, early_stopping=True)


if num_beams=4, max_length=512 are config defaults (512 seems high), they should not be specified.
If 512 is meant to be the source max_length, as I suspect, tokenizer.model_max_length should be set to handle it by default.

The integration test was written by the author, so I'd prefer to leave to ensure model haves as originally expected by the atuhor.

sshleifer · 2020-10-16T16:22:48Z

+For xGLUE corss-lingual NLG tasks, xProphetNet is finetuned with English data, but inference with both English and other zero-shot language data.  
+### Usage
+A quick usage is like: 
+```


same comments as ^^ apply to all model cards.

sshleifer · 2020-10-16T17:12:15Z

+            "microsoft/xprophetnet-large-wiki100-cased-xglue-ntg", use_cdn=False
+        )
+        model.to(torch_device)
+        model.config.max_length = 512


but you generate like 30 tokens?

great test otherwise!

sshleifer · 2020-10-16T17:13:09Z

+    @slow
+    def test_xprophetnet_ntg_inference(self):
+        model = XLMProphetNetForConditionalGeneration.from_pretrained(
+            "microsoft/xprophetnet-large-wiki100-cased-xglue-ntg", use_cdn=False


remove use_cdn?

sshleifer · 2020-10-16T17:14:11Z

+        )
+
+        summary_ids_beam1 = model.generate(
+            input_ids, num_beams=1, length_penalty=1.0, no_repeat_ngram_size=3, early_stopping=True


Suggested change

input_ids, num_beams=1, length_penalty=1.0, no_repeat_ngram_size=3, early_stopping=True

input_ids, num_beams=1,

(assuming config defaults like BART)

sshleifer · 2020-10-16T17:15:20Z

+    def test_is_whitespace(self):
+        self.assertTrue(_is_whitespace(" "))
+        self.assertTrue(_is_whitespace("\t"))
+        self.assertTrue(_is_whitespace("\r"))
+        self.assertTrue(_is_whitespace("\n"))
+        self.assertTrue(_is_whitespace("\u00A0"))
+
+        self.assertFalse(_is_whitespace("A"))
+        self.assertFalse(_is_whitespace("-"))
+
+    def test_is_control(self):
+        self.assertTrue(_is_control("\u0005"))
+
+        self.assertFalse(_is_control("A"))
+        self.assertFalse(_is_control(" "))
+        self.assertFalse(_is_control("\t"))
+        self.assertFalse(_is_control("\r"))
+
+    def test_is_punctuation(self):
+        self.assertTrue(_is_punctuation("-"))
+        self.assertTrue(_is_punctuation("$"))
+        self.assertTrue(_is_punctuation("`"))
+        self.assertTrue(_is_punctuation("."))
+
+        self.assertFalse(_is_punctuation("A"))
+        self.assertFalse(_is_punctuation(" "))


great tests

…nsformers into prophetnet_develop

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

…nsformers into prophetnet_develop

sgugger

The master will be removed by the release master but should be there until then ;-)

LysandreJik

Complicated model! Great job on the implementation and finishing touches!

Mostly nits about logging. Should wait for #7659 to be merged before merging.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

qiweizhen added 3 commits September 15, 2020 22:41

add new model prophetnet

09cdfb3

prophetnet modified modify codes as suggested v1 add prophetnet test files

still bugs, because of changed output formats of encoder and decoder

2681bd9

move prophetnet into the latest version

5343a97

julien-c added the model card Related to pretrained model cards label Sep 16, 2020

patrickvonplaten removed the model card Related to pretrained model cards label Sep 16, 2020

julien-c added the model card Related to pretrained model cards label Sep 16, 2020