Add RoBERTa question answering & Update SQuAD runner to support RoBERTa#1386
Add RoBERTa question answering & Update SQuAD runner to support RoBERTa#1386stevezheng23 wants to merge 12 commits intohuggingface:masterfrom stevezheng23:dev/zheng/roberta
Conversation
|
@thomwolf / @LysandreJik / @VictorSanh / @julien-c Could you help review this PR? Thanks! |
Codecov Report
@@ Coverage Diff @@
## master #1386 +/- ##
==========================================
- Coverage 86.16% 86.01% -0.16%
==========================================
Files 91 91
Lines 13593 13626 +33
==========================================
+ Hits 11713 11720 +7
- Misses 1880 1906 +26
Continue to review full report at Codecov.
|
|
Hi @thomwolf / @LysandreJik / @VictorSanh / @julien-c I have also run experiments using RoBERT large setting in original paper and reproduced their results,
|
|
Awesome @stevezheng23. Can I push on top of your PR to change a few things before we merge? (We refactored the tokenizer to handle the encoding of sequence pairs, including special tokens. So we don't need to do it inside each example script anymore) |
|
@julien-c sure, please add changes in this PR if needed 👍 |
|
@julien-c I've also upload the roberta large model finetuned on squad v2.0 data together with its prediction & evaluation results to public cloud storage https://storage.googleapis.com/mrc_data/squad/roberta.large.squad.v2.zip |
|
Can you check my latest commit @stevezheng23? Main change is that I removed the @thomwolf @LysandreJik this is ready for review. |
|
Everything looks good. As for the
|
|
Great! Good job on reimplementing the cross-entropy loss when start/end positions are given. |
examples/run_squad.py
Outdated
|
|
||
| ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \ | ||
| for conf in (BertConfig, XLNetConfig, XLMConfig)), ()) | ||
| for conf in (BertConfig, RobertaConfig, XLNetConfig, XLMConfig)), ()) |
There was a problem hiding this comment.
Do we need to add DistilBertConfig here?
| query_tokens = tokenizer.tokenize(example.question_text, add_prefix_space=True) | ||
| else: | ||
| query_tokens = tokenizer.tokenize(example.question_text) | ||
| query_tokens = tokenizer.tokenize(example.question_text) |
There was a problem hiding this comment.
I also observed an improvement with add_prefix_space=True when I used roberta
Merge changes from huggingface/transformers to stevezheng23/transformers
|
Look good to me. |
|
@julien-c do you want to add the roberta model finetuned on squad by @stevezheng23 in our library? |
|
Yep @thomwolf |
merge from huggingface/transformers master branch
|
@thomwolf I have updated README file as you suggested, you can merge this PR when you think it's good to go. BTW, it seems CI build is broken |
|
Ok thanks, I'll let @julien-c finish to handle this PR when he's back. |
Hey @stevezheng23 ! I just tried to reproduce your model with slightly different hyperparameters ( Results with your model: Results with the model I trained, on the best checkpoint I was able to obtain after training for 8 epochs: Your hyperparameters: My hyperparameters: Do you have any ideas why this is happening ? One thing that may be happening is that, when using I'd like to see what happens without the need of gradient accumulation - anyone with a spare TPU to share? 😬 |
|
@pminervini I haven't tried out using |
|
@stevezheng23 if you look at it, the https://github.com/huggingface/transformers/blob/master/examples/run_squad.py#L163 @thomwolf what do you think ? should I go and do a PR ? |
huggingface#1386 Open stevezheng23 wants to merge 12 commits into huggingface:master from stevezheng23:dev/zheng/roberta
| XLNetForQuestionAnswering, | ||
| XLNetTokenizer, | ||
| DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer) | ||
| from transformers import (WEIGHTS_NAME, BertConfig, BertForQuestionAnswering, BertTokenizer, |
There was a problem hiding this comment.
Can you add RoBERTa to the title - (Finetuning the library models for question-answering...)
|
|
||
| #### Fine-tuning RoBERTa on SQuAD | ||
|
|
||
| This is an example using 4-GPUs distributed training to fine-tune RoBERTa-large model on the SQuAD v2.0 dataset: |
There was a problem hiding this comment.
Can you add the model of the GPU so it will be easy to tune the per_gpu_eval_batch_size relatively to the memsize of the GPU used in the example
|
@LysandreJik just significantly rewrote our SQuAD integration in #1984 so we were holding out on merging this. Does anyone here want to revisit this PR with the changes from #1984? Otherwise, we'll do it, time permitting. |
|
cool, I'm willing to revisit it. I will take a look at your changes and tansformers' recent updates today (have been away from the Master branch for some time😊). |
You're using num_train_epochs=8 instead of 2, which makes the learning rate decay more slowly. Maybe that is causing the difference? |
|
Regarding RoBERTa also uses |
|
Hi @stevezheng23 @julien-c @thomwolf @ethanjperez , I updated the run squad with roberta in #2173 |
|
Closed in favor of #2173 which should be merged soon. |
No description provided.