Add GPL adaptation tutorial#2632
Add GPL adaptation tutorial#2632vblagoje merged 3 commits intodeepset-ai:masterfrom vblagoje:gpl_tutorial
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
TuanaCelik
left a comment
There was a problem hiding this comment.
Hey @vblagoje - This looks good. A few things I would do:
- Some hand holding maybe. Some explanation on what Generative Pseudo Labeling is and then short descriptions of what a cell block is about to do just above it (you already have this for some of them)
- I think we may have to add this to the headers here but I will check this.
- I would maybe include the full name 'Generative Pseudo Labeling' in the title(s) too. If I understand the tutorial correctly something like this might fit wdyt?: "Generative Pseudo Labeling for Domain Adaptation of Dense Retrieval" or "Domain Adaptation of Dense Retrieval with GPL" (fair if you think it's too long)
|
All good points @TuanaCelik Will make the recommended changes. |
There was a problem hiding this comment.
How about:
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.
There was a problem hiding this comment.
The example below shows you how to use GPL to fine-tune a model so that it can answer the query: "How is COVID-19 transmitted?".
There was a problem hiding this comment.
We're using TAS-B: a DistilBERT model...
Both DistilBERT and MS MARCO were trained on data from 2018 and before, so they don't have any COVID-related information.
There was a problem hiding this comment.
For this example, we're using just four documents. When you ask the model ""How is COVID-19 transmitted?", here are the answers that you get (dot-score and document):
There was a problem hiding this comment.
You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.
There was a problem hiding this comment.
I think # before "run" is not needed here.
There was a problem hiding this comment.
..PseudoLabelGenerator, (comma)
... data. (full stop)
There was a problem hiding this comment.
Verify that EmbeddingRetriever is adapted and save it for future use
There was a problem hiding this comment.
Let's repeat our query to see if the Retriever learned about COVID and can now rank it as #1 among the answers.
|
@agnieszka-m would you please confirm your recommended changes are included? TY |
|
Seems like it is good to go now @agnieszka-m |
There was a problem hiding this comment.
| The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue. | |
| The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing events, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue. |
There was a problem hiding this comment.
| This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data. | |
| This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains and data. |
There was a problem hiding this comment.
| We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model. | |
| We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the COVID knowledge . |
There was a problem hiding this comment.
(Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack
There was a problem hiding this comment.
| The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub. | |
| The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub. |
There was a problem hiding this comment.
| We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model. | |
| We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the new COVID knowledge. |
There was a problem hiding this comment.
| We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid). | |
| We select 10k scientific publications (title + abstract) that are connected to COVID-19. As a dataset, we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid). |
There was a problem hiding this comment.
| """## Optionally download pre-generated questions or even generate them outside of Haystack | |
| """## (Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack |
There was a problem hiding this comment.
| The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub. | |
| The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub. |
There was a problem hiding this comment.
| """## Verify that EmbeddingRetriever is adapted and save it for future use | |
| """## Verify That EmbeddingRetriever Is Adapted and Save It for Future Use |
|
Just minor updates and it's good to go |
|
Hi @agnieszka-m @vblagoje the tutorial doesn't show up on our docs website. It still needs to be added here: https://github.com/deepset-ai/haystack-website/blob/ab49b74cdee82153f56348372e1d3ec2d294d32a/docs/latest/menu.json#L144 |
|
@agnieszka-m I'll make the PR |
* Add GPL adaptation tutorial * Latest round of Aga's corrections * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Proposed changes:
cc @TuanaCelik