Add RoBERTa question answering & Update SQuAD runner to support RoBERTa by stevezheng23 · Pull Request #1386 · huggingface/transformers

stevezheng23 · 2019-09-30T22:26:33Z

No description provided.

stevezheng23 · 2019-09-30T22:33:11Z

@thomwolf / @LysandreJik / @VictorSanh / @julien-c Could you help review this PR? Thanks!

codecov-io · 2019-10-02T21:42:24Z

Codecov Report

Merging #1386 into master will decrease coverage by 0.15%.
The diff coverage is 21.21%.

@@            Coverage Diff             @@
##           master    #1386      +/-   ##
==========================================
- Coverage   86.16%   86.01%   -0.16%     
==========================================
  Files          91       91              
  Lines       13593    13626      +33     
==========================================
+ Hits        11713    11720       +7     
- Misses       1880     1906      +26

Impacted Files	Coverage Δ
transformers/modeling_roberta.py	`69.18% <21.21%> (-11.39%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be916cb...ee83f98. Read the comment docs.

stevezheng23 · 2019-10-03T02:11:40Z

Hi @thomwolf / @LysandreJik / @VictorSanh / @julien-c

I have also run experiments using RoBERT large setting in original paper and reproduced their results,

SQuAD v1.1
{
"exact": 88.25922421948913,
"f1": 94.43790487416292,
"total": 10570,
"HasAns_exact": 88.25922421948913,
"HasAns_f1": 94.43790487416292,
"HasAns_total": 10570
}
SQuAD v2.0
{
"exact": 86.05238777057188,
"f1": 88.99602665148535,
"total": 11873,
"HasAns_exact": 83.38394062078272,
"HasAns_f1": 89.27965999208608,
"HasAns_total": 5928,
"NoAns_exact": 88.71320437342304,
"NoAns_f1": 88.71320437342304,
"NoAns_total": 5945,
"best_exact": 86.5914259243662,
"best_exact_thresh": -2.146007537841797,
"best_f1": 89.43104312625539,
"best_f1_thresh": -2.146007537841797
}

julien-c · 2019-10-03T14:29:34Z

Awesome @stevezheng23. Can I push on top of your PR to change a few things before we merge?

(We refactored the tokenizer to handle the encoding of sequence pairs, including special tokens. So we don't need to do it inside each example script anymore)

stevezheng23 · 2019-10-03T15:56:10Z

@julien-c sure, please add changes in this PR if needed 👍

stevezheng23 · 2019-10-03T17:45:35Z

@julien-c I've also upload the roberta large model finetuned on squad v2.0 data together with its prediction & evaluation results to public cloud storage https://storage.googleapis.com/mrc_data/squad/roberta.large.squad.v2.zip

julien-c · 2019-10-03T23:15:16Z

Can you check my latest commit @stevezheng23? Main change is that I removed the add_prefix_space for RoBERTa (which the RoBERTa authors don't use, as far as I know) which doesn't seem to make a significant difference.

@thomwolf @LysandreJik this is ready for review.

stevezheng23 · 2019-10-04T00:27:52Z

Everything looks good.

As for the add_prefix_space flag,

For add_prefix_space=True, I have run the experiment, the F1 score is around 89.4
For add_prefix_space=False, I have also run the experiment, the F1 score is around 88.2

LysandreJik · 2019-10-04T13:54:04Z

Great! Good job on reimplementing the cross-entropy loss when start/end positions are given.

erenup · 2019-10-05T13:46:22Z

examples/run_squad.py


 ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
-                  for conf in (BertConfig, XLNetConfig, XLMConfig)), ())
+                  for conf in (BertConfig, RobertaConfig, XLNetConfig, XLMConfig)), ())


Do we need to add DistilBertConfig here?

erenup · 2019-10-05T13:47:53Z

examples/utils_squad.py

-            query_tokens = tokenizer.tokenize(example.question_text, add_prefix_space=True)
-        else:
-            query_tokens = tokenizer.tokenize(example.question_text)
+        query_tokens = tokenizer.tokenize(example.question_text)


I also observed an improvement with add_prefix_space=True when I used roberta

Merge changes from huggingface/transformers to stevezheng23/transformers

thomwolf · 2019-10-09T01:22:22Z

Look good to me.
We'll probably be able to simplify utils_squad a lot soon but that will be fine for now.
Do you want to add your experimental results with RoBERTa in examples/readme, with a recommendation to use add_prefix_space=True (fyi it's the opposite for NER)?

thomwolf · 2019-10-09T01:23:33Z

@julien-c do you want to add the roberta model finetuned on squad by @stevezheng23 in our library?

julien-c · 2019-10-09T01:26:31Z

Yep @thomwolf

merge from huggingface/transformers master branch

stevezheng23 · 2019-10-15T17:49:51Z

@thomwolf I have updated README file as you suggested, you can merge this PR when you think it's good to go. BTW, it seems CI build is broken

thomwolf · 2019-10-16T08:15:26Z

Ok thanks, I'll let @julien-c finish to handle this PR when he's back.

pminervini · 2019-10-16T11:55:21Z

@julien-c I've also upload the roberta large model finetuned on squad v2.0 data together with its prediction & evaluation results to public cloud storage https://storage.googleapis.com/mrc_data/squad/roberta.large.squad.v2.zip

Hey @stevezheng23 !

I just tried to reproduce your model with slightly different hyperparameters (batch_size=2 and gradient_accumulation=6 instead of batch_size=12), and I am currently getting worse results.

Results with your model:

{
  "exact": 86.05238777057188,
  "f1": 88.99602665148535,
  "total": 11873,
  "HasAns_exact": 83.38394062078272,
  "HasAns_f1": 89.27965999208608,
  "HasAns_total": 5928,
  "NoAns_exact": 88.71320437342304,
  "NoAns_f1": 88.71320437342304,
  "NoAns_total": 5945
}

Results with the model I trained, on the best checkpoint I was able to obtain after training for 8 epochs:

{
  "exact": 82.85184873241809,
  "f1": 85.85477834702593,
  "total": 11873,
  "HasAns_exact": 77.80026990553306,
  "HasAns_f1": 83.8147407750069,
  "HasAns_total": 5928,
  "NoAns_exact": 87.88898233809924,
  "NoAns_f1": 87.88898233809924,
  "NoAns_total": 5945
}

Your hyperparameters:

Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', device=device(type='cuda', index=0), do_eval=True, do_lower_case=False, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1.5e-05, local_rank=0, logging_steps=50, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=512, max_steps=-1, model_name_or_path='roberta-large', model_type='roberta', n_best_size=20, n_gpu=1, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=2.0, output_dir='output/squad/v2.0/roberta.large', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=12, per_gpu_train_batch_size=12, predict_file='data/squad/v2.0/dev-v2.0.json', save_steps=500, seed=42, server_ip='', server_port='', tokenizer_name='', train_batch_size=12, train_file='data/squad/v2.0/train-v2.0.json', verbose_logging=False, version_2_with_negative=True, warmup_steps=500, weight_decay=0.01)

My hyperparameters:

Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=6, learning_rate=1.5e-05, local_rank=-1, logging_steps=50, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=512, max_steps=-1, model_name_or_path='roberta-large', model_type='roberta', n_best_size=20, n_gpu=1, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=8.0, output_dir='../roberta.large.squad2.v1p', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=2, per_gpu_train_batch_size=2, predict_file='/home/testing/drive/invariance//workspace/data/squad/dev-v2.0.json', save_steps=500, seed=42, server_ip='', server_port='', tokenizer_name='', train_batch_size=2, train_file='/home/testing/drive/invariance//workspace/data/squad/train-v2.0.json', verbose_logging=False, version_2_with_negative=True, warmup_steps=500, weight_decay=0.01)

Do you have any ideas why this is happening ?

One thing that may be happening is that, when using max_grad_norm and gradient_accumulation=n, the clipping of the gradient norm seems to be done n times rather than just 1, but I need to look deeper into this.

I'd like to see what happens without the need of gradient accumulation - anyone with a spare TPU to share? 😬

stevezheng23 · 2019-10-16T16:50:48Z

Ok thanks, I'll let @julien-c finish to handle this PR when he's back.

thanks, @thomwolf

stevezheng23 · 2019-10-16T16:54:56Z

@pminervini I haven't tried out using max_grad_norm and gradient_accumulation=n combination before. One thing you could pay attention to is that the checkpoint is trained with add_prefix_space=True for RoBERTa tokenizer.

pminervini · 2019-10-16T17:07:10Z

@stevezheng23 if you look at it, the max_grad_norm is performed on all the gradients in the accumulation - I think it should be done just before the optimizer.step() call.

https://github.com/huggingface/transformers/blob/master/examples/run_squad.py#L163

@thomwolf what do you think ? should I go and do a PR ?

huggingface#1386 Open stevezheng23 wants to merge 12 commits into huggingface:master from stevezheng23:dev/zheng/roberta

orena1 · 2019-11-05T07:47:55Z

examples/run_squad.py

-                                  XLNetForQuestionAnswering,
-                                  XLNetTokenizer,
-                                  DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
+from transformers import (WEIGHTS_NAME, BertConfig, BertForQuestionAnswering, BertTokenizer,


Can you add RoBERTa to the title - (Finetuning the library models for question-answering...)

orena1 · 2019-11-05T09:03:26Z

examples/README.md


+#### Fine-tuning RoBERTa on SQuAD
+
+This is an example using 4-GPUs distributed training to fine-tune RoBERTa-large model on the SQuAD v2.0 dataset:


Can you add the model of the GPU so it will be easy to tune the per_gpu_eval_batch_size relatively to the memsize of the GPU used in the example

julien-c · 2019-12-10T23:55:21Z

@LysandreJik just significantly rewrote our SQuAD integration in #1984 so we were holding out on merging this.

Does anyone here want to revisit this PR with the changes from #1984? Otherwise, we'll do it, time permitting.

erenup · 2019-12-11T02:07:34Z

cool, I'm willing to revisit it. I will take a look at your changes and tansformers' recent updates today （have been away from the Master branch for some time😊).

ethanjperez · 2019-12-11T19:21:21Z

@julien-c I've also upload the roberta large model finetuned on squad v2.0 data together with its prediction & evaluation results to public cloud storage https://storage.googleapis.com/mrc_data/squad/roberta.large.squad.v2.zip

Hey @stevezheng23 !

I just tried to reproduce your model with slightly different hyperparameters (batch_size=2 and gradient_accumulation=6 instead of batch_size=12), and I am currently getting worse results.
Your hyperparameters:

Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', device=device(type='cuda', index=0), do_eval=True, do_lower_case=False, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=1.5e-05, local_rank=0, logging_steps=50, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=512, max_steps=-1, model_name_or_path='roberta-large', model_type='roberta', n_best_size=20, n_gpu=1, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=2.0, output_dir='output/squad/v2.0/roberta.large', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=12, per_gpu_train_batch_size=12, predict_file='data/squad/v2.0/dev-v2.0.json', save_steps=500, seed=42, server_ip='', server_port='', tokenizer_name='', train_batch_size=12, train_file='data/squad/v2.0/train-v2.0.json', verbose_logging=False, version_2_with_negative=True, warmup_steps=500, weight_decay=0.01)

My hyperparameters:

Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=6, learning_rate=1.5e-05, local_rank=-1, logging_steps=50, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=512, max_steps=-1, model_name_or_path='roberta-large', model_type='roberta', n_best_size=20, n_gpu=1, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=8.0, output_dir='../roberta.large.squad2.v1p', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=2, per_gpu_train_batch_size=2, predict_file='/home/testing/drive/invariance//workspace/data/squad/dev-v2.0.json', save_steps=500, seed=42, server_ip='', server_port='', tokenizer_name='', train_batch_size=2, train_file='/home/testing/drive/invariance//workspace/data/squad/train-v2.0.json', verbose_logging=False, version_2_with_negative=True, warmup_steps=500, weight_decay=0.01)

Do you have any ideas why this is happening ?

You're using num_train_epochs=8 instead of 2, which makes the learning rate decay more slowly. Maybe that is causing the difference?

ethanjperez · 2019-12-11T19:24:01Z

Regarding max_grad_norm - RoBERTa doesn't use gradient clipping, so the max_grad_norm changes aren't strictly necessary here

RoBERTa also uses adam_epsilon=1e-06 as I understand, but I'm not sure if it would change the results here

erenup · 2019-12-14T01:19:11Z

Hi @stevezheng23 @julien-c @thomwolf @ethanjperez , I updated the run squad with roberta in #2173
based on #1984 and #1386. Could you please help to review it? Thank you very much.

julien-c · 2019-12-20T22:18:15Z

Closed in favor of #2173 which should be merged soon.

add roberta qa support & update squad runner/util

fdee828

Mingzhi Zheng added 2 commits October 1, 2019 10:32

disable segment ids in roberta-squad input

a5c12b1

update squad runner to support roberta tokenization

055ec47

julien-c mentioned this pull request Oct 3, 2019

Run squad with roberta erenup/pytorch-transformers#4

Closed

stevezheng23 changed the title ~~add roberta qa support & update squad runner/util~~ Add RoBERTa question answering & Update SQuAD runner to support RoBERTa Oct 3, 2019

simplify things a bit

61d5190

julien-c mentioned this pull request Oct 3, 2019

Add Roberta SQuAD model #1388

Closed

erenup reviewed Oct 5, 2019

View reviewed changes

Mingzhi Zheng and others added 4 commits October 5, 2019 08:53

update model config list for squad runner

4108d80

Merge pull request #1 from huggingface/master

deb2e71

Merge changes from huggingface/transformers to stevezheng23/transformers

resolve conflicts after merging from master branch

d588ea5

fix syntax issue after merging

cc78c30

stevezheng23 and others added 3 commits October 15, 2019 10:17

Merge pull request #2 from huggingface/master

0b99bb3

merge from huggingface/transformers master branch

merge from master branch & resolve merging conflicts

1a60f44

update example/readme.md to include result for roberta-squad experiment

a6ef885

fix doc offset issue for xlnet/xlm

ee83f98

orena1 reviewed Nov 5, 2019

View reviewed changes

erenup mentioned this pull request Dec 14, 2019

run_squad with roberta #2173

Merged

julien-c closed this Dec 20, 2019


		#### Fine-tuning RoBERTa on SQuAD

		This is an example using 4-GPUs distributed training to fine-tune RoBERTa-large model on the SQuAD v2.0 dataset:

Conversation

stevezheng23 commented Sep 30, 2019

Uh oh!

stevezheng23 commented Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stevezheng23 commented Oct 3, 2019

Uh oh!

julien-c commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevezheng23 commented Oct 3, 2019

Uh oh!

stevezheng23 commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julien-c commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevezheng23 commented Oct 4, 2019

Uh oh!

LysandreJik commented Oct 4, 2019

Uh oh!

erenup Oct 5, 2019

Choose a reason for hiding this comment

Uh oh!

stevezheng23 Oct 5, 2019

Choose a reason for hiding this comment

Uh oh!

erenup Oct 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomwolf commented Oct 9, 2019

Uh oh!

thomwolf commented Oct 9, 2019

Uh oh!

julien-c commented Oct 9, 2019

Uh oh!

stevezheng23 commented Oct 15, 2019

Uh oh!

thomwolf commented Oct 16, 2019

Uh oh!

pminervini commented Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevezheng23 commented Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevezheng23 commented Oct 16, 2019

Uh oh!

pminervini commented Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orena1 Nov 5, 2019

Choose a reason for hiding this comment

Uh oh!

orena1 Nov 5, 2019

Choose a reason for hiding this comment

Uh oh!

julien-c commented Dec 10, 2019

Uh oh!

erenup commented Dec 11, 2019

Uh oh!

ethanjperez commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethanjperez commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erenup commented Dec 14, 2019

Uh oh!

julien-c commented Dec 20, 2019

Uh oh!

Reviewers

Assignees

stevezheng23 commented Sep 30, 2019 •

edited

Loading

codecov-io commented Oct 2, 2019 •

edited

Loading

julien-c commented Oct 3, 2019 •

edited

Loading

stevezheng23 commented Oct 3, 2019 •

edited

Loading

julien-c commented Oct 3, 2019 •

edited

Loading

erenup Oct 5, 2019 •

edited

Loading

pminervini commented Oct 16, 2019 •

edited

Loading

stevezheng23 commented Oct 16, 2019 •

edited

Loading

pminervini commented Oct 16, 2019 •

edited

Loading

ethanjperez commented Dec 11, 2019 •

edited

Loading

ethanjperez commented Dec 11, 2019 •

edited

Loading