diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/ASDiv_dataset.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/ASDiv_dataset.ipynb
new file mode 100644
index 000000000..7d28500d4
--- /dev/null
+++ b/demo/tutorials/llm_notebooks/dataset-notebooks/ASDiv_dataset.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","metadata":{"id":"-euMnuisAIDX"},"source":[""]},{"cell_type":"markdown","metadata":{"id":"Gqj3MUP46ZXF"},"source":["[](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/llm_notebooks/dataset-notebooks/ASDiv_dataset.ipynb)"]},{"cell_type":"markdown","metadata":{"id":"wCxsD2KDAWU2"},"source":["**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n","\n","Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."]},{"cell_type":"markdown","metadata":{"id":"jNG1OYuQAgtW"},"source":["# Getting started with LangTest"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"19BPyR196ZXS"},"outputs":[],"source":["!pip install \"langtest[langchain,openai,transformers,evaluate]\""]},{"cell_type":"markdown","metadata":{"id":"EsEtlSiNAnSO"},"source":["# Harness and Its Parameters\n","\n","The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"w2GPpdowS1C9"},"outputs":[],"source":["#Import Harness from the LangTest library\n","from langtest import Harness"]},{"cell_type":"markdown","metadata":{"id":"7_6PF_HGA4EO"},"source":["It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n","\n","Here is a list of the different parameters that can be passed to the Harness function:\n","\n","
\n","\n","\n","| Parameter | Description | \n","| - | - |\n","|**task** |Task for which the model is to be evaluated (question-answering or summarization)|\n","| **model** | Specifies the model(s) to be evaluated. Can be a dictionary or a list of dictionaries. Each dictionary should contain 'model' and 'hub' keys. If a path is specified, the dictionary must contain 'model' and 'hub' keys.|\n","| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
| \n"," | category | \n","test_type | \n","original_context | \n","original_question | \n","perturbed_context | \n","perturbed_question | \n","
|---|---|---|---|---|---|---|
| 0 | \n","robustness | \n","uppercase | \n","Seven red apples and two green apples are in t... | \n","How many apples are in the basket? | \n","SEVEN RED APPLES AND TWO GREEN APPLES ARE IN T... | \n","HOW MANY APPLES ARE IN THE BASKET? | \n","
| 1 | \n","robustness | \n","uppercase | \n","Ellen has six more balls than Marin. Marin has... | \n","How many balls does Ellen have? | \n","ELLEN HAS SIX MORE BALLS THAN MARIN. MARIN HAS... | \n","HOW MANY BALLS DOES ELLEN HAVE? | \n","
| 2 | \n","robustness | \n","uppercase | \n","Janet has nine oranges and Sharon has seven or... | \n","How many oranges do Janet and Sharon have toge... | \n","JANET HAS NINE ORANGES AND SHARON HAS SEVEN OR... | \n","HOW MANY ORANGES DO JANET AND SHARON HAVE TOGE... | \n","
| 3 | \n","robustness | \n","uppercase | \n","Allan brought two balloons and Jake brought fo... | \n","How many balloons did Allan and Jake have in t... | \n","ALLAN BROUGHT TWO BALLOONS AND JAKE BROUGHT FO... | \n","HOW MANY BALLOONS DID ALLAN AND JAKE HAVE IN T... | \n","
| 4 | \n","robustness | \n","uppercase | \n","Adam has five more apples than Jackie. Jackie ... | \n","How many apples does Adam have? | \n","ADAM HAS FIVE MORE APPLES THAN JACKIE. JACKIE ... | \n","HOW MANY APPLES DOES ADAM HAVE? | \n","
| ... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","
| 95 | \n","robustness | \n","lowercase | \n","Mrs. Hilt spent 25 cents on one caramel apple ... | \n","How much more did the apple cost? | \n","mrs. hilt spent 25 cents on one caramel apple ... | \n","how much more did the apple cost? | \n","
| 96 | \n","robustness | \n","lowercase | \n","Mrs. Hilt bought 2 pizzas. Each pizza had 8 sl... | \n","How many total slices of pizza did she have? | \n","mrs. hilt bought 2 pizzas. each pizza had 8 sl... | \n","how many total slices of pizza did she have? | \n","
| 97 | \n","robustness | \n","lowercase | \n","Mrs. Hilt read 2 books per day. | \n","How many books did she read in one week? | \n","mrs. hilt read 2 books per day. | \n","how many books did she read in one week? | \n","
| 98 | \n","robustness | \n","lowercase | \n","Mrs. Hilt ate 5 apples every hour. | \n","How many apples had she eaten at the end of 3 ... | \n","mrs. hilt ate 5 apples every hour. | \n","how many apples had she eaten at the end of 3 ... | \n","
| 99 | \n","robustness | \n","lowercase | \n","Mrs. Hilt gave 2 pieces of candy to each stude... | \n","How many pieces of candy did Mrs. Hilt give away? | \n","mrs. hilt gave 2 pieces of candy to each stude... | \n","how many pieces of candy did mrs. hilt give away? | \n","
100 rows × 6 columns
\n","| \n"," | category | \n","test_type | \n","original_context | \n","original_question | \n","perturbed_context | \n","perturbed_question | \n","expected_result | \n","actual_result | \n","pass | \n","
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n","robustness | \n","uppercase | \n","Seven red apples and two green apples are in t... | \n","How many apples are in the basket? | \n","SEVEN RED APPLES AND TWO GREEN APPLES ARE IN T... | \n","HOW MANY APPLES ARE IN THE BASKET? | \n","Nine apples are in the basket. | \n","Nine apples are in the basket. | \n","True | \n","
| 1 | \n","robustness | \n","uppercase | \n","Ellen has six more balls than Marin. Marin has... | \n","How many balls does Ellen have? | \n","ELLEN HAS SIX MORE BALLS THAN MARIN. MARIN HAS... | \n","HOW MANY BALLS DOES ELLEN HAVE? | \n","Ellen has fifteen balls. | \n","Ellen has fifteen balls. | \n","True | \n","
| 2 | \n","robustness | \n","uppercase | \n","Janet has nine oranges and Sharon has seven or... | \n","How many oranges do Janet and Sharon have toge... | \n","JANET HAS NINE ORANGES AND SHARON HAS SEVEN OR... | \n","HOW MANY ORANGES DO JANET AND SHARON HAVE TOGE... | \n","Janet and Sharon have a total of sixteen oran... | \n","Janet and Sharon have a total of sixteen oran... | \n","True | \n","
| 3 | \n","robustness | \n","uppercase | \n","Allan brought two balloons and Jake brought fo... | \n","How many balloons did Allan and Jake have in t... | \n","ALLAN BROUGHT TWO BALLOONS AND JAKE BROUGHT FO... | \n","HOW MANY BALLOONS DID ALLAN AND JAKE HAVE IN T... | \n","Allan and Jake had six balloons in the park. | \n","Allan and Jake had six balloons in the park. | \n","True | \n","
| 4 | \n","robustness | \n","uppercase | \n","Adam has five more apples than Jackie. Jackie ... | \n","How many apples does Adam have? | \n","ADAM HAS FIVE MORE APPLES THAN JACKIE. JACKIE ... | \n","HOW MANY APPLES DOES ADAM HAVE? | \n","Adam has 14 apples. | \n","Adam has 14 apples. | \n","True | \n","
| ... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","
| 95 | \n","robustness | \n","lowercase | \n","Mrs. Hilt spent 25 cents on one caramel apple ... | \n","How much more did the apple cost? | \n","mrs. hilt spent 25 cents on one caramel apple ... | \n","how much more did the apple cost? | \n","The apple cost 10 cents more than the ice cre... | \n","The apple cost 10 cents more than the ice cre... | \n","True | \n","
| 96 | \n","robustness | \n","lowercase | \n","Mrs. Hilt bought 2 pizzas. Each pizza had 8 sl... | \n","How many total slices of pizza did she have? | \n","mrs. hilt bought 2 pizzas. each pizza had 8 sl... | \n","how many total slices of pizza did she have? | \n","Mrs. Hilt had 16 total slices of pizza. | \n","Mrs. Hilt had 16 total slices of pizza. | \n","True | \n","
| 97 | \n","robustness | \n","lowercase | \n","Mrs. Hilt read 2 books per day. | \n","How many books did she read in one week? | \n","mrs. hilt read 2 books per day. | \n","how many books did she read in one week? | \n","Mrs. Hilt read 14 books in one week. | \n","Mrs. Hilt read 14 books in one week. | \n","True | \n","
| 98 | \n","robustness | \n","lowercase | \n","Mrs. Hilt ate 5 apples every hour. | \n","How many apples had she eaten at the end of 3 ... | \n","mrs. hilt ate 5 apples every hour. | \n","how many apples had she eaten at the end of 3 ... | \n","Mrs. Hilt had eaten 15 apples at the end of 3... | \n","Mrs. Hilt had eaten 15 apples at the end of 3... | \n","True | \n","
| 99 | \n","robustness | \n","lowercase | \n","Mrs. Hilt gave 2 pieces of candy to each stude... | \n","How many pieces of candy did Mrs. Hilt give away? | \n","mrs. hilt gave 2 pieces of candy to each stude... | \n","how many pieces of candy did mrs. hilt give away? | \n","Mrs. Hilt gave away 18 pieces of candy. | \n","Mrs. Hilt gave away 18 pieces of candy. | \n","True | \n","
100 rows × 9 columns
\n","| \n"," | category | \n","test_type | \n","fail_count | \n","pass_count | \n","pass_rate | \n","minimum_pass_rate | \n","pass | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","robustness | \n","uppercase | \n","1 | \n","49 | \n","98% | \n","66% | \n","True | \n","
| 1 | \n","robustness | \n","lowercase | \n","1 | \n","49 | \n","98% | \n","60% | \n","True | \n","
| \n"," | category | \n","test_type | \n","test_case | \n","
|---|---|---|---|
| 0 | \n","fairness | \n","min_gender_rouge1_score | \n","male | \n","
| 1 | \n","fairness | \n","min_gender_rouge1_score | \n","female | \n","
| 2 | \n","fairness | \n","min_gender_rouge1_score | \n","unknown | \n","
| 3 | \n","fairness | \n","min_gender_rouge2_score | \n","male | \n","
| 4 | \n","fairness | \n","min_gender_rouge2_score | \n","female | \n","
| 5 | \n","fairness | \n","min_gender_rouge2_score | \n","unknown | \n","
| 6 | \n","fairness | \n","min_gender_rougeL_score | \n","male | \n","
| 7 | \n","fairness | \n","min_gender_rougeL_score | \n","female | \n","
| 8 | \n","fairness | \n","min_gender_rougeL_score | \n","unknown | \n","
| 9 | \n","fairness | \n","min_gender_rougeLsum_score | \n","male | \n","
| 10 | \n","fairness | \n","min_gender_rougeLsum_score | \n","female | \n","
| 11 | \n","fairness | \n","min_gender_rougeLsum_score | \n","unknown | \n","
| 12 | \n","fairness | \n","max_gender_rouge1_score | \n","male | \n","
| 13 | \n","fairness | \n","max_gender_rouge1_score | \n","female | \n","
| 14 | \n","fairness | \n","max_gender_rouge1_score | \n","unknown | \n","
| 15 | \n","fairness | \n","max_gender_rouge2_score | \n","male | \n","
| 16 | \n","fairness | \n","max_gender_rouge2_score | \n","female | \n","
| 17 | \n","fairness | \n","max_gender_rouge2_score | \n","unknown | \n","
| 18 | \n","fairness | \n","max_gender_rougeL_score | \n","male | \n","
| 19 | \n","fairness | \n","max_gender_rougeL_score | \n","female | \n","
| 20 | \n","fairness | \n","max_gender_rougeL_score | \n","unknown | \n","
| 21 | \n","fairness | \n","max_gender_rougeLsum_score | \n","male | \n","
| 22 | \n","fairness | \n","max_gender_rougeLsum_score | \n","female | \n","
| 23 | \n","fairness | \n","max_gender_rougeLsum_score | \n","unknown | \n","
| \n"," | category | \n","test_type | \n","test_case | \n","expected_result | \n","actual_result | \n","pass | \n","
|---|---|---|---|---|---|---|
| 0 | \n","fairness | \n","min_gender_rouge1_score | \n","male | \n","0.66 | \n","0.428889 | \n","False | \n","
| 1 | \n","fairness | \n","min_gender_rouge1_score | \n","female | \n","0.66 | \n","0.360332 | \n","False | \n","
| 2 | \n","fairness | \n","min_gender_rouge1_score | \n","unknown | \n","0.66 | \n","0.200000 | \n","False | \n","
| 3 | \n","fairness | \n","min_gender_rouge2_score | \n","male | \n","0.60 | \n","0.228571 | \n","False | \n","
| 4 | \n","fairness | \n","min_gender_rouge2_score | \n","female | \n","0.60 | \n","0.179523 | \n","False | \n","
| 5 | \n","fairness | \n","min_gender_rouge2_score | \n","unknown | \n","0.60 | \n","0.000000 | \n","False | \n","
| 6 | \n","fairness | \n","min_gender_rougeL_score | \n","male | \n","0.66 | \n","0.425000 | \n","False | \n","
| 7 | \n","fairness | \n","min_gender_rougeL_score | \n","female | \n","0.66 | \n","0.359968 | \n","False | \n","
| 8 | \n","fairness | \n","min_gender_rougeL_score | \n","unknown | \n","0.66 | \n","0.200000 | \n","False | \n","
| 9 | \n","fairness | \n","min_gender_rougeLsum_score | \n","male | \n","0.66 | \n","0.427639 | \n","False | \n","
| 10 | \n","fairness | \n","min_gender_rougeLsum_score | \n","female | \n","0.66 | \n","0.358361 | \n","False | \n","
| 11 | \n","fairness | \n","min_gender_rougeLsum_score | \n","unknown | \n","0.66 | \n","0.200000 | \n","False | \n","
| 12 | \n","fairness | \n","max_gender_rouge1_score | \n","male | \n","0.66 | \n","0.428889 | \n","True | \n","
| 13 | \n","fairness | \n","max_gender_rouge1_score | \n","female | \n","0.66 | \n","0.360332 | \n","True | \n","
| 14 | \n","fairness | \n","max_gender_rouge1_score | \n","unknown | \n","0.66 | \n","0.200000 | \n","True | \n","
| 15 | \n","fairness | \n","max_gender_rouge2_score | \n","male | \n","0.60 | \n","0.228571 | \n","True | \n","
| 16 | \n","fairness | \n","max_gender_rouge2_score | \n","female | \n","0.60 | \n","0.179523 | \n","True | \n","
| 17 | \n","fairness | \n","max_gender_rouge2_score | \n","unknown | \n","0.60 | \n","0.000000 | \n","True | \n","
| 18 | \n","fairness | \n","max_gender_rougeL_score | \n","male | \n","0.66 | \n","0.425000 | \n","True | \n","
| 19 | \n","fairness | \n","max_gender_rougeL_score | \n","female | \n","0.66 | \n","0.359968 | \n","True | \n","
| 20 | \n","fairness | \n","max_gender_rougeL_score | \n","unknown | \n","0.66 | \n","0.200000 | \n","True | \n","
| 21 | \n","fairness | \n","max_gender_rougeLsum_score | \n","male | \n","0.66 | \n","0.427639 | \n","True | \n","
| 22 | \n","fairness | \n","max_gender_rougeLsum_score | \n","female | \n","0.66 | \n","0.358361 | \n","True | \n","
| 23 | \n","fairness | \n","max_gender_rougeLsum_score | \n","unknown | \n","0.66 | \n","0.200000 | \n","True | \n","
| \n"," | category | \n","test_type | \n","fail_count | \n","pass_count | \n","pass_rate | \n","minimum_pass_rate | \n","pass | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","fairness | \n","min_gender_rouge1_score | \n","3 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 1 | \n","fairness | \n","min_gender_rouge2_score | \n","3 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 2 | \n","fairness | \n","min_gender_rougeL_score | \n","3 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 3 | \n","fairness | \n","min_gender_rougeLsum_score | \n","3 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 4 | \n","fairness | \n","max_gender_rouge1_score | \n","0 | \n","3 | \n","100% | \n","65% | \n","True | \n","
| 5 | \n","fairness | \n","max_gender_rouge2_score | \n","0 | \n","3 | \n","100% | \n","65% | \n","True | \n","
| 6 | \n","fairness | \n","max_gender_rougeL_score | \n","0 | \n","3 | \n","100% | \n","65% | \n","True | \n","
| 7 | \n","fairness | \n","max_gender_rougeLsum_score | \n","0 | \n","3 | \n","100% | \n","65% | \n","True | \n","
| \n"," | category | \n","test_type | \n","
|---|---|---|
| 0 | \n","accuracy | \n","min_exact_match_score | \n","
| 1 | \n","accuracy | \n","min_rouge1_score | \n","
| 2 | \n","accuracy | \n","min_rougeL_score | \n","
| 3 | \n","accuracy | \n","min_bleu_score | \n","
| 4 | \n","accuracy | \n","min_rouge2_score | \n","
| 5 | \n","accuracy | \n","min_rougeLsum_score | \n","
| \n"," | category | \n","test_type | \n","expected_result | \n","actual_result | \n","pass | \n","
|---|---|---|---|---|---|
| 0 | \n","accuracy | \n","min_exact_match_score | \n","0.8 | \n","0.000000 | \n","False | \n","
| 1 | \n","accuracy | \n","min_rouge1_score | \n","0.8 | \n","0.372327 | \n","False | \n","
| 2 | \n","accuracy | \n","min_rougeL_score | \n","0.8 | \n","0.368632 | \n","False | \n","
| 3 | \n","accuracy | \n","min_bleu_score | \n","0.8 | \n","0.000000 | \n","False | \n","
| 4 | \n","accuracy | \n","min_rouge2_score | \n","0.8 | \n","0.188883 | \n","False | \n","
| 5 | \n","accuracy | \n","min_rougeLsum_score | \n","0.8 | \n","0.371052 | \n","False | \n","
| \n"," | category | \n","test_type | \n","fail_count | \n","pass_count | \n","pass_rate | \n","minimum_pass_rate | \n","pass | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","accuracy | \n","min_exact_match_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 1 | \n","accuracy | \n","min_rouge1_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 2 | \n","accuracy | \n","min_rougeL_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 3 | \n","accuracy | \n","min_bleu_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 4 | \n","accuracy | \n","min_rouge2_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 5 | \n","accuracy | \n","min_rougeLsum_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "
|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The children had been sitting outside of the k... | \n", + "This narrative is a good illustration of the f... | \n", + "THE CHILDREN HAD BEEN SITTING OUTSIDE OF THE K... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "He dresses in a gothic style: all black clothi... | \n", + "This narrative is a good illustration of the f... | \n", + "HE DRESSES IN A GOTHIC STYLE: ALL BLACK CLOTHI... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "She always wanted to go on a vacation to a pla... | \n", + "This narrative is a good illustration of the f... | \n", + "SHE ALWAYS WANTED TO GO ON A VACATION TO A PLA... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "The man who owned the little corner diner for ... | \n", + "This narrative is a good illustration of the f... | \n", + "THE MAN WHO OWNED THE LITTLE CORNER DINER FOR ... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "Dwayne was a singer. He went to a bar to part... | \n", + "This narrative is a good illustration of the f... | \n", + "DWAYNE WAS A SINGER. HE WENT TO A BAR TO PARTY... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
130 rows × 6 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The children had been sitting outside of the k... | \n", + "This narrative is a good illustration of the f... | \n", + "THE CHILDREN HAD BEEN SITTING OUTSIDE OF THE K... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "1. Good things come to those that wait | \n", + "1. GOOD THINGS COME TO THOSE THAT WAIT | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "He dresses in a gothic style: all black clothi... | \n", + "This narrative is a good illustration of the f... | \n", + "HE DRESSES IN A GOTHIC STYLE: ALL BLACK CLOTHI... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "Never judge a book by its cover | \n", + "1. Never judge a book by its cover | \n", + "True | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "She always wanted to go on a vacation to a pla... | \n", + "This narrative is a good illustration of the f... | \n", + "SHE ALWAYS WANTED TO GO ON A VACATION TO A PLA... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "4. That which does not kill us makes us stronger | \n", + "4. THAT WHICH DOES NOT KILL US MAKES US STRONGER | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "The man who owned the little corner diner for ... | \n", + "This narrative is a good illustration of the f... | \n", + "THE MAN WHO OWNED THE LITTLE CORNER DINER FOR ... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "4. Never judge a book by its cover | \n", + "3. THERE'S NO ACCOUNTING FOR TASTES | \n", + "False | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "Dwayne was a singer. He went to a bar to part... | \n", + "This narrative is a good illustration of the f... | \n", + "DWAYNE WAS A SINGER. HE WENT TO A BAR TO PARTY... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "1. All publicity is good publicity | \n", + "1. All Publicity is Good Publicity | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.613867 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.604897 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.412708 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.587841 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.610359 | \n", + "False | \n", + "
130 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The children had been sitting outside of the k... | \n", + "This narrative is a good illustration of the f... | \n", + "THE CHILDREN HAD BEEN SITTING OUTSIDE OF THE K... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "1. Good things come to those that wait | \n", + "1. GOOD THINGS COME TO THOSE THAT WAIT | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "He dresses in a gothic style: all black clothi... | \n", + "This narrative is a good illustration of the f... | \n", + "HE DRESSES IN A GOTHIC STYLE: ALL BLACK CLOTHI... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "Never judge a book by its cover | \n", + "1. Never judge a book by its cover | \n", + "True | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "She always wanted to go on a vacation to a pla... | \n", + "This narrative is a good illustration of the f... | \n", + "SHE ALWAYS WANTED TO GO ON A VACATION TO A PLA... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "4. That which does not kill us makes us stronger | \n", + "4. THAT WHICH DOES NOT KILL US MAKES US STRONGER | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "The man who owned the little corner diner for ... | \n", + "This narrative is a good illustration of the f... | \n", + "THE MAN WHO OWNED THE LITTLE CORNER DINER FOR ... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "4. Never judge a book by its cover | \n", + "3. THERE'S NO ACCOUNTING FOR TASTES | \n", + "False | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "Dwayne was a singer. He went to a bar to part... | \n", + "This narrative is a good illustration of the f... | \n", + "DWAYNE WAS A SINGER. HE WENT TO A BAR TO PARTY... | \n", + "THIS NARRATIVE IS A GOOD ILLUSTRATION OF THE F... | \n", + "1. All publicity is good publicity | \n", + "1. All Publicity is Good Publicity | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 95 | \n", + "robustness | \n", + "lowercase | \n", + "There was a shopkeeper that noticed his stock ... | \n", + "This narrative is a good illustration of the f... | \n", + "there was a shopkeeper that noticed his stock ... | \n", + "this narrative is a good illustration of the f... | \n", + "5. It takes a thief to catch a thief | \n", + "5. It takes a thief to catch a thief | \n", + "True | \n", + "
| 96 | \n", + "robustness | \n", + "lowercase | \n", + "After gazing at the store front for about ten ... | \n", + "This narrative is a good illustration of the f... | \n", + "after gazing at the store front for about ten ... | \n", + "this narrative is a good illustration of the f... | \n", + "1. Cut your coat to suit your cloth | \n", + "1. cut your coat to suit your cloth | \n", + "True | \n", + "
| 97 | \n", + "robustness | \n", + "lowercase | \n", + "Their business had been one of the most succes... | \n", + "This narrative is a good illustration of the f... | \n", + "their business had been one of the most succes... | \n", + "this narrative is a good illustration of the f... | \n", + "A house divided against itself cannot stand. | \n", + "A. A house divided against itself cannot stand | \n", + "True | \n", + "
| 98 | \n", + "robustness | \n", + "lowercase | \n", + "A couple went on a trip to a distant country. ... | \n", + "This narrative is a good illustration of the f... | \n", + "a couple went on a trip to a distant country. ... | \n", + "this narrative is a good illustration of the f... | \n", + "1. Bad news travels fast | \n", + "1. Bad News Travels Fast | \n", + "True | \n", + "
| 99 | \n", + "robustness | \n", + "lowercase | \n", + "He wanted to work on some crowdsourced micro w... | \n", + "This narrative is a good illustration of the f... | \n", + "he wanted to work on some crowdsourced micro w... | \n", + "this narrative is a good illustration of the f... | \n", + "1. You are never too old to learn | \n", + "You are never too old to learn. | \n", + "True | \n", + "
100 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 100 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.59319 | \n", + "False | \n", + "
| 101 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.661657 | \n", + "True | \n", + "
| 102 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "True | \n", + "
| 103 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.559098 | \n", + "False | \n", + "
| 104 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.637991 | \n", + "True | \n", + "
| 105 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "1.0 | \n", + "True | \n", + "
| 106 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.589718 | \n", + "False | \n", + "
| 107 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.65639 | \n", + "False | \n", + "
| 108 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "True | \n", + "
| 109 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.585628 | \n", + "False | \n", + "
| 110 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.659493 | \n", + "False | \n", + "
| 111 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "True | \n", + "
| 112 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.59319 | \n", + "True | \n", + "
| 113 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.661657 | \n", + "False | \n", + "
| 114 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "False | \n", + "
| 115 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.559098 | \n", + "True | \n", + "
| 116 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.637991 | \n", + "False | \n", + "
| 117 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "1.0 | \n", + "False | \n", + "
| 118 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.589718 | \n", + "True | \n", + "
| 119 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.65639 | \n", + "True | \n", + "
| 120 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "False | \n", + "
| 121 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.585628 | \n", + "True | \n", + "
| 122 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.659493 | \n", + "True | \n", + "
| 123 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 124 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.02 | \n", + "False | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.613867 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.604897 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.412708 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.587841 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.610359 | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "fail_count | \n", + "pass_count | \n", + "pass_rate | \n", + "minimum_pass_rate | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "5 | \n", + "45 | \n", + "90% | \n", + "66% | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "lowercase | \n", + "3 | \n", + "47 | \n", + "94% | \n", + "60% | \n", + "True | \n", + "
| 2 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "1 | \n", + "2 | \n", + "67% | \n", + "65% | \n", + "True | \n", + "
| 3 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "1 | \n", + "2 | \n", + "67% | \n", + "65% | \n", + "True | \n", + "
| 4 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "2 | \n", + "1 | \n", + "33% | \n", + "65% | \n", + "False | \n", + "
| 5 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "2 | \n", + "1 | \n", + "33% | \n", + "65% | \n", + "False | \n", + "
| 6 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "2 | \n", + "1 | \n", + "33% | \n", + "65% | \n", + "False | \n", + "
| 7 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "2 | \n", + "1 | \n", + "33% | \n", + "65% | \n", + "False | \n", + "
| 8 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "1 | \n", + "2 | \n", + "67% | \n", + "65% | \n", + "True | \n", + "
| 9 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "1 | \n", + "2 | \n", + "67% | \n", + "65% | \n", + "True | \n", + "
| 10 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 11 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 12 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 13 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 14 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 15 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "
|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The CEO of a company is sitting in his office ... | \n", + "Did the CEO intentionally harm the environment? | \n", + "THE CEO OF A COMPANY IS SITTING IN HIS OFFICE ... | \n", + "DID THE CEO INTENTIONALLY HARM THE ENVIRONMENT? | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "The CEO of a company is sitting in his office ... | \n", + "Did the CEO intentionally help the environment? | \n", + "THE CEO OF A COMPANY IS SITTING IN HIS OFFICE ... | \n", + "DID THE CEO INTENTIONALLY HELP THE ENVIRONMENT? | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "Lauren and Jane work for the same company. The... | \n", + "Did Jane cause the computer to crash? | \n", + "LAUREN AND JANE WORK FOR THE SAME COMPANY. THE... | \n", + "DID JANE CAUSE THE COMPUTER TO CRASH? | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "Lauren and Jane work for the same company. The... | \n", + "Did Jane cause the computer to crash? | \n", + "LAUREN AND JANE WORK FOR THE SAME COMPANY. THE... | \n", + "DID JANE CAUSE THE COMPUTER TO CRASH? | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "The receptionist in the philosophy department ... | \n", + "Did Professor Smith cause the problem? | \n", + "THE RECEPTIONIST IN THE PHILOSOPHY DEPARTMENT ... | \n", + "DID PROFESSOR SMITH CAUSE THE PROBLEM? | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
130 rows × 6 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The CEO of a company is sitting in his office ... | \n", + "Did the CEO intentionally harm the environment? | \n", + "THE CEO OF A COMPANY IS SITTING IN HIS OFFICE ... | \n", + "DID THE CEO INTENTIONALLY HARM THE ENVIRONMENT? | \n", + "No | \n", + "No. | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "The CEO of a company is sitting in his office ... | \n", + "Did the CEO intentionally help the environment? | \n", + "THE CEO OF A COMPANY IS SITTING IN HIS OFFICE ... | \n", + "DID THE CEO INTENTIONALLY HELP THE ENVIRONMENT? | \n", + "No | \n", + "No | \n", + "True | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "Lauren and Jane work for the same company. The... | \n", + "Did Jane cause the computer to crash? | \n", + "LAUREN AND JANE WORK FOR THE SAME COMPANY. THE... | \n", + "DID JANE CAUSE THE COMPUTER TO CRASH? | \n", + "Yes | \n", + "Yes | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "Lauren and Jane work for the same company. The... | \n", + "Did Jane cause the computer to crash? | \n", + "LAUREN AND JANE WORK FOR THE SAME COMPANY. THE... | \n", + "DID JANE CAUSE THE COMPUTER TO CRASH? | \n", + "No | \n", + "No | \n", + "True | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "The receptionist in the philosophy department ... | \n", + "Did Professor Smith cause the problem? | \n", + "THE RECEPTIONIST IN THE PHILOSOPHY DEPARTMENT ... | \n", + "DID PROFESSOR SMITH CAUSE THE PROBLEM? | \n", + "Yes | \n", + "Yes | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.72 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.72 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.0 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.0 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.72 | \n", + "False | \n", + "
130 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The CEO of a company is sitting in his office ... | \n", + "Did the CEO intentionally harm the environment? | \n", + "THE CEO OF A COMPANY IS SITTING IN HIS OFFICE ... | \n", + "DID THE CEO INTENTIONALLY HARM THE ENVIRONMENT? | \n", + "No | \n", + "No. | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "The CEO of a company is sitting in his office ... | \n", + "Did the CEO intentionally help the environment? | \n", + "THE CEO OF A COMPANY IS SITTING IN HIS OFFICE ... | \n", + "DID THE CEO INTENTIONALLY HELP THE ENVIRONMENT? | \n", + "No | \n", + "No | \n", + "True | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "Lauren and Jane work for the same company. The... | \n", + "Did Jane cause the computer to crash? | \n", + "LAUREN AND JANE WORK FOR THE SAME COMPANY. THE... | \n", + "DID JANE CAUSE THE COMPUTER TO CRASH? | \n", + "Yes | \n", + "Yes | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "Lauren and Jane work for the same company. The... | \n", + "Did Jane cause the computer to crash? | \n", + "LAUREN AND JANE WORK FOR THE SAME COMPANY. THE... | \n", + "DID JANE CAUSE THE COMPUTER TO CRASH? | \n", + "No | \n", + "No | \n", + "True | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "The receptionist in the philosophy department ... | \n", + "Did Professor Smith cause the problem? | \n", + "THE RECEPTIONIST IN THE PHILOSOPHY DEPARTMENT ... | \n", + "DID PROFESSOR SMITH CAUSE THE PROBLEM? | \n", + "Yes | \n", + "Yes | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 95 | \n", + "robustness | \n", + "lowercase | \n", + "There is a man who gets paid for pumping water... | \n", + "Did the man intentionally poison the inhabitants? | \n", + "there is a man who gets paid for pumping water... | \n", + "did the man intentionally poison the inhabitants? | \n", + "No | \n", + "No | \n", + "True | \n", + "
| 96 | \n", + "robustness | \n", + "lowercase | \n", + "Frank T., had an ongoing dispute with his neig... | \n", + "intentionally shoot his neighbor in the body? | \n", + "frank t., had an ongoing dispute with his neig... | \n", + "intentionally shoot his neighbor in the body? | \n", + "No. | \n", + "No. | \n", + "True | \n", + "
| 97 | \n", + "robustness | \n", + "lowercase | \n", + "Frank T., had an ongoing dispute with his neig... | \n", + "intentionally shoot his neighbor in the body? | \n", + "frank t., had an ongoing dispute with his neig... | \n", + "intentionally shoot his neighbor in the body? | \n", + "Yes | \n", + "Yes | \n", + "True | \n", + "
| 98 | \n", + "robustness | \n", + "lowercase | \n", + "George and his sister Lena reunite at their pa... | \n", + "Did George hit the low point region intentiona... | \n", + "george and his sister lena reunite at their pa... | \n", + "did george hit the low point region intentiona... | \n", + "Yes | \n", + "Yes | \n", + "True | \n", + "
| 99 | \n", + "robustness | \n", + "lowercase | \n", + "George and his sister Lena reunite at their pa... | \n", + "Did George hit the low point region intentiona... | \n", + "george and his sister lena reunite at their pa... | \n", + "did george hit the low point region intentiona... | \n", + "No | \n", + "No | \n", + "True | \n", + "
100 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 100 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.666667 | \n", + "True | \n", + "
| 101 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.875 | \n", + "True | \n", + "
| 102 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "True | \n", + "
| 103 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.0 | \n", + "False | \n", + "
| 104 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.0 | \n", + "False | \n", + "
| 105 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.0 | \n", + "False | \n", + "
| 106 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.666667 | \n", + "True | \n", + "
| 107 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.875 | \n", + "True | \n", + "
| 108 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "True | \n", + "
| 109 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.666667 | \n", + "True | \n", + "
| 110 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.875 | \n", + "True | \n", + "
| 111 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "True | \n", + "
| 112 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.666667 | \n", + "False | \n", + "
| 113 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.875 | \n", + "False | \n", + "
| 114 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "False | \n", + "
| 115 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.0 | \n", + "True | \n", + "
| 116 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.0 | \n", + "True | \n", + "
| 117 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.0 | \n", + "True | \n", + "
| 118 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.666667 | \n", + "False | \n", + "
| 119 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.875 | \n", + "False | \n", + "
| 120 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "False | \n", + "
| 121 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.666667 | \n", + "False | \n", + "
| 122 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.875 | \n", + "False | \n", + "
| 123 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "1.0 | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 124 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.58 | \n", + "False | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.72 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.72 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.0 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.0 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.72 | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "fail_count | \n", + "pass_count | \n", + "pass_rate | \n", + "minimum_pass_rate | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "4 | \n", + "46 | \n", + "92% | \n", + "66% | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "lowercase | \n", + "0 | \n", + "50 | \n", + "100% | \n", + "60% | \n", + "True | \n", + "
| 2 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 3 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 4 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 5 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 6 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 7 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 8 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 9 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 10 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 11 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 12 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 13 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 14 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 15 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "
|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The common allotrope of elemental oxygen on Ea... | \n", + "What part the composition of the Earth ' s bio... | \n", + "THE COMMON ALLOTROPE OF ELEMENTAL OXYGEN ON EA... | \n", + "WHAT PART THE COMPOSITION OF THE EARTH ' S BIO... | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "In addition to identifying rocks in the field ... | \n", + "What do petrologists use rock samples or rathe... | \n", + "IN ADDITION TO IDENTIFYING ROCKS IN THE FIELD ... | \n", + "WHAT DO PETROLOGISTS USE ROCK SAMPLES OR RATHE... | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "The four - year , full - time undergraduate pr... | \n", + "Post 2008 undergraduate students are required ... | \n", + "THE FOUR - YEAR , FULL - TIME UNDERGRADUATE PR... | \n", + "POST 2008 UNDERGRADUATE STUDENTS ARE REQUIRED ... | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "Some Normans joined Turkish forces to aid in t... | \n", + "How was or no make that what was the Norman ca... | \n", + "SOME NORMANS JOINED TURKISH FORCES TO AID IN T... | \n", + "HOW WAS OR NO MAKE THAT WHAT WAS THE NORMAN CA... | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "Current faculty include the anthropologist Mar... | \n", + "Who is the current , oh no , what Shakespeare ... | \n", + "CURRENT FACULTY INCLUDE THE ANTHROPOLOGIST MAR... | \n", + "WHO IS THE CURRENT , OH NO , WHAT SHAKESPEARE ... | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
130 rows × 6 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The common allotrope of elemental oxygen on Ea... | \n", + "What part the composition of the Earth ' s bio... | \n", + "THE COMMON ALLOTROPE OF ELEMENTAL OXYGEN ON EA... | \n", + "WHAT PART THE COMPOSITION OF THE EARTH ' S BIO... | \n", + "The Earth's atmosphere is comprised of dioxyg... | \n", + "The common allotrope of elemental oxygen on E... | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "In addition to identifying rocks in the field ... | \n", + "What do petrologists use rock samples or rathe... | \n", + "IN ADDITION TO IDENTIFYING ROCKS IN THE FIELD ... | \n", + "WHAT DO PETROLOGISTS USE ROCK SAMPLES OR RATHE... | \n", + "Petrologists use electron microprobes in the ... | \n", + "Petrologists use electron microprobes in the ... | \n", + "True | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "The four - year , full - time undergraduate pr... | \n", + "Post 2008 undergraduate students are required ... | \n", + "THE FOUR - YEAR , FULL - TIME UNDERGRADUATE PR... | \n", + "POST 2008 UNDERGRADUATE STUDENTS ARE REQUIRED ... | \n", + "Since 2008, undergraduate students have been ... | \n", + "Post 2008 undergraduate students are required... | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "Some Normans joined Turkish forces to aid in t... | \n", + "How was or no make that what was the Norman ca... | \n", + "SOME NORMANS JOINED TURKISH FORCES TO AID IN T... | \n", + "HOW WAS OR NO MAKE THAT WHAT WAS THE NORMAN CA... | \n", + "The Norman castle was called Afranji, meaning... | \n", + "The Norman castle was called Afranji, meaning... | \n", + "True | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "Current faculty include the anthropologist Mar... | \n", + "Who is the current , oh no , what Shakespeare ... | \n", + "CURRENT FACULTY INCLUDE THE ANTHROPOLOGIST MAR... | \n", + "WHO IS THE CURRENT , OH NO , WHAT SHAKESPEARE ... | \n", + "David Bevington is the Shakespeare scholar cu... | \n", + "David Bevington is the Shakespeare scholar cu... | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.260139 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.260312 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.093749 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.177426 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.261074 | \n", + "False | \n", + "
130 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "The common allotrope of elemental oxygen on Ea... | \n", + "What part the composition of the Earth ' s bio... | \n", + "THE COMMON ALLOTROPE OF ELEMENTAL OXYGEN ON EA... | \n", + "WHAT PART THE COMPOSITION OF THE EARTH ' S BIO... | \n", + "The Earth's atmosphere is comprised of dioxyg... | \n", + "The common allotrope of elemental oxygen on E... | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "In addition to identifying rocks in the field ... | \n", + "What do petrologists use rock samples or rathe... | \n", + "IN ADDITION TO IDENTIFYING ROCKS IN THE FIELD ... | \n", + "WHAT DO PETROLOGISTS USE ROCK SAMPLES OR RATHE... | \n", + "Petrologists use electron microprobes in the ... | \n", + "Petrologists use electron microprobes in the ... | \n", + "True | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "The four - year , full - time undergraduate pr... | \n", + "Post 2008 undergraduate students are required ... | \n", + "THE FOUR - YEAR , FULL - TIME UNDERGRADUATE PR... | \n", + "POST 2008 UNDERGRADUATE STUDENTS ARE REQUIRED ... | \n", + "Since 2008, undergraduate students have been ... | \n", + "Post 2008 undergraduate students are required... | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "Some Normans joined Turkish forces to aid in t... | \n", + "How was or no make that what was the Norman ca... | \n", + "SOME NORMANS JOINED TURKISH FORCES TO AID IN T... | \n", + "HOW WAS OR NO MAKE THAT WHAT WAS THE NORMAN CA... | \n", + "The Norman castle was called Afranji, meaning... | \n", + "The Norman castle was called Afranji, meaning... | \n", + "True | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "Current faculty include the anthropologist Mar... | \n", + "Who is the current , oh no , what Shakespeare ... | \n", + "CURRENT FACULTY INCLUDE THE ANTHROPOLOGIST MAR... | \n", + "WHO IS THE CURRENT , OH NO , WHAT SHAKESPEARE ... | \n", + "David Bevington is the Shakespeare scholar cu... | \n", + "David Bevington is the Shakespeare scholar cu... | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 95 | \n", + "robustness | \n", + "lowercase | \n", + "Kublai Khan promoted commercial , scientific ,... | \n", + "How did no make that where did Kublai extend t... | \n", + "kublai khan promoted commercial , scientific ,... | \n", + "how did no make that where did kublai extend t... | \n", + "Kublai extended the Grand Canal from southern... | \n", + "Kublai extended the Grand Canal from southern... | \n", + "True | \n", + "
| 96 | \n", + "robustness | \n", + "lowercase | \n", + "There are 13 natural reserves in Warsaw – amon... | \n", + "How far from Czerniak \\ \\ u00f3w Lake or uh be... | \n", + "there are 13 natural reserves in warsaw – amon... | \n", + "how far from czerniak \\ \\ u00f3w lake or uh be... | \n", + "About 15 kilometres (9 miles) from Warsaw. | \n", + "About 15 kilometres (9 miles) from Warsaw. | \n", + "True | \n", + "
| 97 | \n", + "robustness | \n", + "lowercase | \n", + "If the input size is n , the time taken can be... | \n", + "What is the term I mean what is the function o... | \n", + "if the input size is n , the time taken can be... | \n", + "what is the term i mean what is the function o... | \n", + "T(n) is the maximum time taken over all input... | \n", + "t(n) is the maximum time taken over all input... | \n", + "True | \n", + "
| 98 | \n", + "robustness | \n", + "lowercase | \n", + "Almost all species are hermaphrodites , in oth... | \n", + "What species or uh more broadly genus has self... | \n", + "almost all species are hermaphrodites , in oth... | \n", + "what species or uh more broadly genus has self... | \n", + "Self-fertilization has been seen in species o... | \n", + "Self-fertilization has occasionally been seen... | \n", + "True | \n", + "
| 99 | \n", + "robustness | \n", + "lowercase | \n", + "The first buildings of the University of Chica... | \n", + "Who helped designed the University of Chicago ... | \n", + "the first buildings of the university of chica... | \n", + "who helped designed the university of chicago ... | \n", + "The buildings of the Main Quadrangles were de... | \n", + "The buildings of the main quadrangles were de... | \n", + "True | \n", + "
100 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 100 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.241166 | \n", + "False | \n", + "
| 101 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.247522 | \n", + "False | \n", + "
| 102 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.276853 | \n", + "False | \n", + "
| 103 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.170777 | \n", + "False | \n", + "
| 104 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.171024 | \n", + "False | \n", + "
| 105 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.176068 | \n", + "False | \n", + "
| 106 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.247401 | \n", + "False | \n", + "
| 107 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.241493 | \n", + "False | \n", + "
| 108 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.28231 | \n", + "False | \n", + "
| 109 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.246201 | \n", + "False | \n", + "
| 110 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.244666 | \n", + "False | \n", + "
| 111 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.272514 | \n", + "False | \n", + "
| 112 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.241166 | \n", + "True | \n", + "
| 113 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.247522 | \n", + "True | \n", + "
| 114 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.276853 | \n", + "True | \n", + "
| 115 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.170777 | \n", + "True | \n", + "
| 116 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.171024 | \n", + "True | \n", + "
| 117 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.176068 | \n", + "True | \n", + "
| 118 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.247401 | \n", + "True | \n", + "
| 119 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.241493 | \n", + "True | \n", + "
| 120 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.28231 | \n", + "True | \n", + "
| 121 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.246201 | \n", + "True | \n", + "
| 122 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.244666 | \n", + "True | \n", + "
| 123 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.272514 | \n", + "True | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 124 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.0 | \n", + "False | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.260139 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.260312 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.093749 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.177426 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.261074 | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "fail_count | \n", + "pass_count | \n", + "pass_rate | \n", + "minimum_pass_rate | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "5 | \n", + "45 | \n", + "90% | \n", + "66% | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "lowercase | \n", + "1 | \n", + "49 | \n", + "98% | \n", + "60% | \n", + "True | \n", + "
| 2 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 3 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 4 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 5 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 6 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 7 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 8 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 9 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 10 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 11 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 12 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 13 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 14 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 15 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "
|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The patient was referred to the specialist bec... | \n", + "- | \n", + "THE PATIENT WAS REFERRED TO THE SPECIALIST BEC... | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The scientist collaborated with the artist, an... | \n", + "- | \n", + "THE SCIENTIST COLLABORATED WITH THE ARTIST, AN... | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The developer was unable to communicate with t... | \n", + "- | \n", + "THE DEVELOPER WAS UNABLE TO COMMUNICATE WITH T... | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The scientist will collaborate with the artist... | \n", + "- | \n", + "THE SCIENTIST WILL COLLABORATE WITH THE ARTIST... | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "Bailey planned to meet the director at their o... | \n", + "- | \n", + "BAILEY PLANNED TO MEET THE DIRECTOR AT THEIR O... | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "
130 rows × 6 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The patient was referred to the specialist bec... | \n", + "- | \n", + "THE PATIENT WAS REFERRED TO THE SPECIALIST BEC... | \n", + "1. The patient had a skin condition | \n", + "1. The Patient had a skin condition | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The scientist collaborated with the artist, an... | \n", + "- | \n", + "THE SCIENTIST COLLABORATED WITH THE ARTIST, AN... | \n", + "1. Ambiguous | \n", + "1. AMBIGUOUS | \n", + "False | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The developer was unable to communicate with t... | \n", + "- | \n", + "THE DEVELOPER WAS UNABLE TO COMMUNICATE WITH T... | \n", + "3. The writer uses big words | \n", + "3. The Writer Uses Big Words | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The scientist will collaborate with the artist... | \n", + "- | \n", + "THE SCIENTIST WILL COLLABORATE WITH THE ARTIST... | \n", + "1. Ambiguous | \n", + "3. The Artist will share a story | \n", + "False | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "Bailey planned to meet the director at their o... | \n", + "- | \n", + "BAILEY PLANNED TO MEET THE DIRECTOR AT THEIR O... | \n", + "2. The office was the director's office | \n", + "2. The Office Was the Director's Office | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.494407 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.49901 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.387823 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.38544 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.501615 | \n", + "False | \n", + "
130 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The patient was referred to the specialist bec... | \n", + "- | \n", + "THE PATIENT WAS REFERRED TO THE SPECIALIST BEC... | \n", + "1. The patient had a skin condition | \n", + "1. The Patient had a skin condition | \n", + "True | \n", + "
| 1 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The scientist collaborated with the artist, an... | \n", + "- | \n", + "THE SCIENTIST COLLABORATED WITH THE ARTIST, AN... | \n", + "1. Ambiguous | \n", + "1. AMBIGUOUS | \n", + "False | \n", + "
| 2 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The developer was unable to communicate with t... | \n", + "- | \n", + "THE DEVELOPER WAS UNABLE TO COMMUNICATE WITH T... | \n", + "3. The writer uses big words | \n", + "3. The Writer Uses Big Words | \n", + "True | \n", + "
| 3 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "The scientist will collaborate with the artist... | \n", + "- | \n", + "THE SCIENTIST WILL COLLABORATE WITH THE ARTIST... | \n", + "1. Ambiguous | \n", + "3. The Artist will share a story | \n", + "False | \n", + "
| 4 | \n", + "robustness | \n", + "uppercase | \n", + "- | \n", + "Bailey planned to meet the director at their o... | \n", + "- | \n", + "BAILEY PLANNED TO MEET THE DIRECTOR AT THEIR O... | \n", + "2. The office was the director's office | \n", + "2. The Office Was the Director's Office | \n", + "True | \n", + "
| ... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
| 95 | \n", + "robustness | \n", + "lowercase | \n", + "- | \n", + "Alex tells us that they could not meet.\\n cho... | \n", + "- | \n", + "alex tells us that they could not meet. choice... | \n", + "1. Alex could not meet | \n", + "1. Alex could not meet | \n", + "True | \n", + "
| 96 | \n", + "robustness | \n", + "lowercase | \n", + "- | \n", + "The lawyer looked into illegal accusations aga... | \n", + "- | \n", + "the lawyer looked into illegal accusations aga... | \n", + "The lawyer needed to understand | \n", + "2. the cashier needed to understand | \n", + "False | \n", + "
| 97 | \n", + "robustness | \n", + "lowercase | \n", + "- | \n", + "The nurse was running from the laborer because... | \n", + "- | \n", + "the nurse was running from the laborer because... | \n", + "3. The laborer smelled awful | \n", + "3. the laborer smelled awful | \n", + "True | \n", + "
| 98 | \n", + "robustness | \n", + "lowercase | \n", + "- | \n", + "This story could not be published without the ... | \n", + "- | \n", + "this story could not be published without the ... | \n", + "They were the writer's friends | \n", + "they were the writer's friends | \n", + "True | \n", + "
| 99 | \n", + "robustness | \n", + "lowercase | \n", + "- | \n", + "The mechanic greets the receptionist because s... | \n", + "- | \n", + "the mechanic greets the receptionist because s... | \n", + "3. The receptionist is standing in front of t... | \n", + "3. the receptionist is standing in front of t... | \n", + "True | \n", + "
100 rows × 9 columns
\n", + "| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 100 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.621077 | \n", + "False | \n", + "
| 101 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.510823 | \n", + "False | \n", + "
| 102 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.428914 | \n", + "False | \n", + "
| 103 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.450932 | \n", + "False | \n", + "
| 104 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.412698 | \n", + "False | \n", + "
| 105 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.34422 | \n", + "False | \n", + "
| 106 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.613542 | \n", + "False | \n", + "
| 107 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.510823 | \n", + "False | \n", + "
| 108 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.425556 | \n", + "False | \n", + "
| 109 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.614703 | \n", + "False | \n", + "
| 110 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.510823 | \n", + "False | \n", + "
| 111 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.427665 | \n", + "False | \n", + "
| 112 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.621077 | \n", + "True | \n", + "
| 113 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.510823 | \n", + "True | \n", + "
| 114 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.428914 | \n", + "True | \n", + "
| 115 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.450932 | \n", + "True | \n", + "
| 116 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.412698 | \n", + "True | \n", + "
| 117 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.6 | \n", + "0.34422 | \n", + "True | \n", + "
| 118 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.613542 | \n", + "True | \n", + "
| 119 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.510823 | \n", + "True | \n", + "
| 120 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.425556 | \n", + "True | \n", + "
| 121 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "male | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.614703 | \n", + "True | \n", + "
| 122 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "female | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.510823 | \n", + "True | \n", + "
| 123 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "- | \n", + "unknown | \n", + "- | \n", + "- | \n", + "0.66 | \n", + "0.427665 | \n", + "True | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "original_context | \n", + "original_question | \n", + "perturbed_context | \n", + "perturbed_question | \n", + "expected_result | \n", + "actual_result | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 124 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.14 | \n", + "False | \n", + "
| 125 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.494407 | \n", + "False | \n", + "
| 126 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.49901 | \n", + "False | \n", + "
| 127 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.387823 | \n", + "False | \n", + "
| 128 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.38544 | \n", + "False | \n", + "
| 129 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "- | \n", + "- | \n", + "- | \n", + "- | \n", + "0.8 | \n", + "0.501615 | \n", + "False | \n", + "
| \n", + " | category | \n", + "test_type | \n", + "fail_count | \n", + "pass_count | \n", + "pass_rate | \n", + "minimum_pass_rate | \n", + "pass | \n", + "
|---|---|---|---|---|---|---|---|
| 0 | \n", + "robustness | \n", + "uppercase | \n", + "20 | \n", + "30 | \n", + "60% | \n", + "66% | \n", + "False | \n", + "
| 1 | \n", + "robustness | \n", + "lowercase | \n", + "18 | \n", + "32 | \n", + "64% | \n", + "60% | \n", + "True | \n", + "
| 2 | \n", + "fairness | \n", + "min_gender_rouge1_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 3 | \n", + "fairness | \n", + "min_gender_rouge2_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 4 | \n", + "fairness | \n", + "min_gender_rougeL_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 5 | \n", + "fairness | \n", + "min_gender_rougeLsum_score | \n", + "3 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 6 | \n", + "fairness | \n", + "max_gender_rouge1_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 7 | \n", + "fairness | \n", + "max_gender_rouge2_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 8 | \n", + "fairness | \n", + "max_gender_rougeL_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 9 | \n", + "fairness | \n", + "max_gender_rougeLsum_score | \n", + "0 | \n", + "3 | \n", + "100% | \n", + "65% | \n", + "True | \n", + "
| 10 | \n", + "accuracy | \n", + "min_exact_match_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 11 | \n", + "accuracy | \n", + "min_rouge1_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 12 | \n", + "accuracy | \n", + "min_rougeL_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 13 | \n", + "accuracy | \n", + "min_bleu_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 14 | \n", + "accuracy | \n", + "min_rouge2_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| 15 | \n", + "accuracy | \n", + "min_rougeLsum_score | \n", + "1 | \n", + "0 | \n", + "0% | \n", + "65% | \n", + "False | \n", + "
| \n"," | category | \n","test_type | \n","original_context | \n","original_question | \n","perturbed_context | \n","perturbed_question | \n","
|---|---|---|---|---|---|---|
| 0 | \n","robustness | \n","uppercase | \n","In the planning of a new district in a townshi... | \n","Based on the above statement, which of the fol... | \n","IN THE PLANNING OF A NEW DISTRICT IN A TOWNSHI... | \n","BASED ON THE ABOVE STATEMENT, WHICH OF THE FOL... | \n","
| 1 | \n","robustness | \n","uppercase | \n","The company sent three young staff members to ... | \n","So what are the three young people on business... | \n","THE COMPANY SENT THREE YOUNG STAFF MEMBERS TO ... | \n","SO WHAT ARE THE THREE YOUNG PEOPLE ON BUSINESS... | \n","
| 2 | \n","robustness | \n","uppercase | \n","In a traditional Chinese medicine preparation,... | \n","According to the above statement, which of the... | \n","IN A TRADITIONAL CHINESE MEDICINE PREPARATION,... | \n","ACCORDING TO THE ABOVE STATEMENT, WHICH OF THE... | \n","
| 3 | \n","robustness | \n","uppercase | \n","In recent years, graduate entrance examination... | \n","Which of the following can best strengthen the... | \n","IN RECENT YEARS, GRADUATE ENTRANCE EXAMINATION... | \n","WHICH OF THE FOLLOWING CAN BEST STRENGTHEN THE... | \n","
| 4 | \n","robustness | \n","uppercase | \n","A unit conducted the year-end assessment and a... | \n","According to the above statement, it can be co... | \n","A UNIT CONDUCTED THE YEAR-END ASSESSMENT AND A... | \n","ACCORDING TO THE ABOVE STATEMENT, IT CAN BE CO... | \n","
| ... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","
| 95 | \n","robustness | \n","lowercase | \n","Recently, discussions on whether to gradually ... | \n","Which of the following, if true, best supports... | \n","recently, discussions on whether to gradually ... | \n","which of the following, if true, best supports... | \n","
| 96 | \n","robustness | \n","lowercase | \n","A certain online forum made a statistical comp... | \n","Which of the following, if true, would weaken ... | \n","a certain online forum made a statistical comp... | \n","which of the following, if true, would weaken ... | \n","
| 97 | \n","robustness | \n","lowercase | \n","On November 17, 2012, the \"Tianhe No.1\" superc... | \n","Which of the following is most suitable as a c... | \n","on november 17, 2012, the \"tianhe no.1\" superc... | \n","which of the following is most suitable as a c... | \n","
| 98 | \n","robustness | \n","lowercase | \n","With the help of animal fossils and DNA retain... | \n","Which of the following, if true, would best re... | \n","with the help of animal fossils and dna retain... | \n","which of the following, if true, would best re... | \n","
| 99 | \n","robustness | \n","lowercase | \n","Many pregnant women have symptoms of vitamin d... | \n","Which of the following is most important for e... | \n","many pregnant women have symptoms of vitamin d... | \n","which of the following is most important for e... | \n","
100 rows × 6 columns
\n","| \n"," | category | \n","test_type | \n","original_context | \n","original_question | \n","perturbed_context | \n","perturbed_question | \n","expected_result | \n","actual_result | \n","pass | \n","
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n","robustness | \n","uppercase | \n","In the planning of a new district in a townshi... | \n","Based on the above statement, which of the fol... | \n","IN THE PLANNING OF A NEW DISTRICT IN A TOWNSHI... | \n","BASED ON THE ABOVE STATEMENT, WHICH OF THE FOL... | \n","B. The leisure area is southwest of the cultu... | \n","B. The Leisure Area is Southwest of the Cultu... | \n","True | \n","
| 1 | \n","robustness | \n","uppercase | \n","The company sent three young staff members to ... | \n","So what are the three young people on business... | \n","THE COMPANY SENT THREE YOUNG STAFF MEMBERS TO ... | \n","SO WHAT ARE THE THREE YOUNG PEOPLE ON BUSINESS... | \n","A. 0-year-old accountant, 20-year-old salespe... | \n","A. 0-YEAR-OLD ACCOUNTANT, 20-YEAR-OLD SALESPE... | \n","True | \n","
| 2 | \n","robustness | \n","uppercase | \n","In a traditional Chinese medicine preparation,... | \n","According to the above statement, which of the... | \n","IN A TRADITIONAL CHINESE MEDICINE PREPARATION,... | \n","ACCORDING TO THE ABOVE STATEMENT, WHICH OF THE... | \n","B. o Shouwu. | \n","B. O SHOUWU. | \n","True | \n","
| 3 | \n","robustness | \n","uppercase | \n","In recent years, graduate entrance examination... | \n","Which of the following can best strengthen the... | \n","IN RECENT YEARS, GRADUATE ENTRANCE EXAMINATION... | \n","WHICH OF THE FOLLOWING CAN BEST STRENGTHEN THE... | \n","B. Only those who intend to take the graduate... | \n","B. ONLY THOSE WHO INTEND TO TAKE THE GRADUATE... | \n","True | \n","
| 4 | \n","robustness | \n","uppercase | \n","A unit conducted the year-end assessment and a... | \n","According to the above statement, it can be co... | \n","A UNIT CONDUCTED THE YEAR-END ASSESSMENT AND A... | \n","ACCORDING TO THE ABOVE STATEMENT, IT CAN BE CO... | \n","C. C. | \n","D. DING. | \n","False | \n","
| ... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","... | \n","
| 95 | \n","robustness | \n","lowercase | \n","Recently, discussions on whether to gradually ... | \n","Which of the following, if true, best supports... | \n","recently, discussions on whether to gradually ... | \n","which of the following, if true, best supports... | \n","A. Many people now find a second career after... | \n","A. many people now find a second career after... | \n","True | \n","
| 96 | \n","robustness | \n","lowercase | \n","A certain online forum made a statistical comp... | \n","Which of the following, if true, would weaken ... | \n","a certain online forum made a statistical comp... | \n","which of the following, if true, would weaken ... | \n","B. The number of Internet users has quadruple... | \n","B. the number of internet users has quadruple... | \n","True | \n","
| 97 | \n","robustness | \n","lowercase | \n","On November 17, 2012, the \"Tianhe No.1\" superc... | \n","Which of the following is most suitable as a c... | \n","on november 17, 2012, the \"tianhe no.1\" superc... | \n","which of the following is most suitable as a c... | \n","D. China's \"Tianhe 2\" computing speed is clea... | \n","D. China's \"Tianhe 2\" computing speed is clea... | \n","True | \n","
| 98 | \n","robustness | \n","lowercase | \n","With the help of animal fossils and DNA retain... | \n","Which of the following, if true, would best re... | \n","with the help of animal fossils and dna retain... | \n","which of the following, if true, would best re... | \n","C. Even if the extinct animals can be resurre... | \n","C. even if the extinct animals can be resurre... | \n","True | \n","
| 99 | \n","robustness | \n","lowercase | \n","Many pregnant women have symptoms of vitamin d... | \n","Which of the following is most important for e... | \n","many pregnant women have symptoms of vitamin d... | \n","which of the following is most important for e... | \n","C. Test pregnant women and other women with i... | \n","c. test pregnant women and other women with i... | \n","True | \n","
100 rows × 9 columns
\n","| \n"," | category | \n","test_type | \n","fail_count | \n","pass_count | \n","pass_rate | \n","minimum_pass_rate | \n","pass | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","robustness | \n","uppercase | \n","12 | \n","38 | \n","76% | \n","66% | \n","True | \n","
| 1 | \n","robustness | \n","lowercase | \n","10 | \n","40 | \n","80% | \n","60% | \n","True | \n","
| \n"," | category | \n","test_type | \n","test_case | \n","
|---|---|---|---|
| 0 | \n","fairness | \n","min_gender_rouge1_score | \n","male | \n","
| 1 | \n","fairness | \n","min_gender_rouge1_score | \n","female | \n","
| 2 | \n","fairness | \n","min_gender_rouge1_score | \n","unknown | \n","
| 3 | \n","fairness | \n","min_gender_rouge2_score | \n","male | \n","
| 4 | \n","fairness | \n","min_gender_rouge2_score | \n","female | \n","
| 5 | \n","fairness | \n","min_gender_rouge2_score | \n","unknown | \n","
| 6 | \n","fairness | \n","min_gender_rougeL_score | \n","male | \n","
| 7 | \n","fairness | \n","min_gender_rougeL_score | \n","female | \n","
| 8 | \n","fairness | \n","min_gender_rougeL_score | \n","unknown | \n","
| 9 | \n","fairness | \n","min_gender_rougeLsum_score | \n","male | \n","
| 10 | \n","fairness | \n","min_gender_rougeLsum_score | \n","female | \n","
| 11 | \n","fairness | \n","min_gender_rougeLsum_score | \n","unknown | \n","
| 12 | \n","fairness | \n","max_gender_rouge1_score | \n","male | \n","
| 13 | \n","fairness | \n","max_gender_rouge1_score | \n","female | \n","
| 14 | \n","fairness | \n","max_gender_rouge1_score | \n","unknown | \n","
| 15 | \n","fairness | \n","max_gender_rouge2_score | \n","male | \n","
| 16 | \n","fairness | \n","max_gender_rouge2_score | \n","female | \n","
| 17 | \n","fairness | \n","max_gender_rouge2_score | \n","unknown | \n","
| 18 | \n","fairness | \n","max_gender_rougeL_score | \n","male | \n","
| 19 | \n","fairness | \n","max_gender_rougeL_score | \n","female | \n","
| 20 | \n","fairness | \n","max_gender_rougeL_score | \n","unknown | \n","
| 21 | \n","fairness | \n","max_gender_rougeLsum_score | \n","male | \n","
| 22 | \n","fairness | \n","max_gender_rougeLsum_score | \n","female | \n","
| 23 | \n","fairness | \n","max_gender_rougeLsum_score | \n","unknown | \n","
| \n"," | category | \n","test_type | \n","test_case | \n","expected_result | \n","actual_result | \n","pass | \n","
|---|---|---|---|---|---|---|
| 0 | \n","fairness | \n","min_gender_rouge1_score | \n","male | \n","0.66 | \n","0.454654 | \n","False | \n","
| 1 | \n","fairness | \n","min_gender_rouge1_score | \n","female | \n","0.66 | \n","0.692470 | \n","True | \n","
| 2 | \n","fairness | \n","min_gender_rouge1_score | \n","unknown | \n","0.66 | \n","0.637062 | \n","False | \n","
| 3 | \n","fairness | \n","min_gender_rouge2_score | \n","male | \n","0.60 | \n","0.406318 | \n","False | \n","
| 4 | \n","fairness | \n","min_gender_rouge2_score | \n","female | \n","0.60 | \n","0.609633 | \n","True | \n","
| 5 | \n","fairness | \n","min_gender_rouge2_score | \n","unknown | \n","0.60 | \n","0.544937 | \n","False | \n","
| 6 | \n","fairness | \n","min_gender_rougeL_score | \n","male | \n","0.66 | \n","0.428440 | \n","False | \n","
| 7 | \n","fairness | \n","min_gender_rougeL_score | \n","female | \n","0.66 | \n","0.678184 | \n","True | \n","
| 8 | \n","fairness | \n","min_gender_rougeL_score | \n","unknown | \n","0.66 | \n","0.597261 | \n","False | \n","
| 9 | \n","fairness | \n","min_gender_rougeLsum_score | \n","male | \n","0.66 | \n","0.428123 | \n","False | \n","
| 10 | \n","fairness | \n","min_gender_rougeLsum_score | \n","female | \n","0.66 | \n","0.678184 | \n","True | \n","
| 11 | \n","fairness | \n","min_gender_rougeLsum_score | \n","unknown | \n","0.66 | \n","0.595965 | \n","False | \n","
| 12 | \n","fairness | \n","max_gender_rouge1_score | \n","male | \n","0.66 | \n","0.454654 | \n","True | \n","
| 13 | \n","fairness | \n","max_gender_rouge1_score | \n","female | \n","0.66 | \n","0.692470 | \n","False | \n","
| 14 | \n","fairness | \n","max_gender_rouge1_score | \n","unknown | \n","0.66 | \n","0.637062 | \n","True | \n","
| 15 | \n","fairness | \n","max_gender_rouge2_score | \n","male | \n","0.60 | \n","0.406318 | \n","True | \n","
| 16 | \n","fairness | \n","max_gender_rouge2_score | \n","female | \n","0.60 | \n","0.609633 | \n","False | \n","
| 17 | \n","fairness | \n","max_gender_rouge2_score | \n","unknown | \n","0.60 | \n","0.544937 | \n","True | \n","
| 18 | \n","fairness | \n","max_gender_rougeL_score | \n","male | \n","0.66 | \n","0.428440 | \n","True | \n","
| 19 | \n","fairness | \n","max_gender_rougeL_score | \n","female | \n","0.66 | \n","0.678184 | \n","False | \n","
| 20 | \n","fairness | \n","max_gender_rougeL_score | \n","unknown | \n","0.66 | \n","0.597261 | \n","True | \n","
| 21 | \n","fairness | \n","max_gender_rougeLsum_score | \n","male | \n","0.66 | \n","0.428123 | \n","True | \n","
| 22 | \n","fairness | \n","max_gender_rougeLsum_score | \n","female | \n","0.66 | \n","0.678184 | \n","False | \n","
| 23 | \n","fairness | \n","max_gender_rougeLsum_score | \n","unknown | \n","0.66 | \n","0.595965 | \n","True | \n","
| \n"," | category | \n","test_type | \n","fail_count | \n","pass_count | \n","pass_rate | \n","minimum_pass_rate | \n","pass | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","fairness | \n","min_gender_rouge1_score | \n","2 | \n","1 | \n","33% | \n","65% | \n","False | \n","
| 1 | \n","fairness | \n","min_gender_rouge2_score | \n","2 | \n","1 | \n","33% | \n","65% | \n","False | \n","
| 2 | \n","fairness | \n","min_gender_rougeL_score | \n","2 | \n","1 | \n","33% | \n","65% | \n","False | \n","
| 3 | \n","fairness | \n","min_gender_rougeLsum_score | \n","2 | \n","1 | \n","33% | \n","65% | \n","False | \n","
| 4 | \n","fairness | \n","max_gender_rouge1_score | \n","1 | \n","2 | \n","67% | \n","65% | \n","True | \n","
| 5 | \n","fairness | \n","max_gender_rouge2_score | \n","1 | \n","2 | \n","67% | \n","65% | \n","True | \n","
| 6 | \n","fairness | \n","max_gender_rougeL_score | \n","1 | \n","2 | \n","67% | \n","65% | \n","True | \n","
| 7 | \n","fairness | \n","max_gender_rougeLsum_score | \n","1 | \n","2 | \n","67% | \n","65% | \n","True | \n","
| \n"," | category | \n","test_type | \n","
|---|---|---|
| 0 | \n","accuracy | \n","min_exact_match_score | \n","
| 1 | \n","accuracy | \n","min_rouge1_score | \n","
| 2 | \n","accuracy | \n","min_rougeL_score | \n","
| 3 | \n","accuracy | \n","min_bleu_score | \n","
| 4 | \n","accuracy | \n","min_rouge2_score | \n","
| 5 | \n","accuracy | \n","min_rougeLsum_score | \n","
| \n"," | category | \n","test_type | \n","expected_result | \n","actual_result | \n","pass | \n","
|---|---|---|---|---|---|
| 0 | \n","accuracy | \n","min_exact_match_score | \n","0.8 | \n","0.380000 | \n","False | \n","
| 1 | \n","accuracy | \n","min_rouge1_score | \n","0.8 | \n","0.576272 | \n","False | \n","
| 2 | \n","accuracy | \n","min_rougeL_score | \n","0.8 | \n","0.545441 | \n","False | \n","
| 3 | \n","accuracy | \n","min_bleu_score | \n","0.8 | \n","0.511692 | \n","False | \n","
| 4 | \n","accuracy | \n","min_rouge2_score | \n","0.8 | \n","0.506556 | \n","False | \n","
| 5 | \n","accuracy | \n","min_rougeLsum_score | \n","0.8 | \n","0.547528 | \n","False | \n","
| \n"," | category | \n","test_type | \n","fail_count | \n","pass_count | \n","pass_rate | \n","minimum_pass_rate | \n","pass | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","accuracy | \n","min_exact_match_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 1 | \n","accuracy | \n","min_rouge1_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 2 | \n","accuracy | \n","min_rougeL_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 3 | \n","accuracy | \n","min_bleu_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 4 | \n","accuracy | \n","min_rouge2_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","
| 5 | \n","accuracy | \n","min_rougeLsum_score | \n","1 | \n","0 | \n","0% | \n","65% | \n","False | \n","