Skip to content

Add GPL adaptation tutorial#2632

Merged
vblagoje merged 3 commits intodeepset-ai:masterfrom
vblagoje:gpl_tutorial
Jun 26, 2022
Merged

Add GPL adaptation tutorial#2632
vblagoje merged 3 commits intodeepset-ai:masterfrom
vblagoje:gpl_tutorial

Conversation

@vblagoje
Copy link
Copy Markdown
Member

@vblagoje vblagoje commented Jun 3, 2022

Proposed changes:

  • Adds GPL tutorial

cc @TuanaCelik

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@vblagoje vblagoje requested a review from TuanaCelik June 3, 2022 14:31
@vblagoje vblagoje added the type:documentation Improvements on the docs label Jun 3, 2022
Copy link
Copy Markdown
Contributor

@TuanaCelik TuanaCelik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vblagoje - This looks good. A few things I would do:

  • Some hand holding maybe. Some explanation on what Generative Pseudo Labeling is and then short descriptions of what a cell block is about to do just above it (you already have this for some of them)
  • I think we may have to add this to the headers here but I will check this.
  • I would maybe include the full name 'Generative Pseudo Labeling' in the title(s) too. If I understand the tutorial correctly something like this might fit wdyt?: "Generative Pseudo Labeling for Domain Adaptation of Dense Retrieval" or "Domain Adaptation of Dense Retrieval with GPL" (fair if you think it's too long)

@vblagoje
Copy link
Copy Markdown
Member Author

vblagoje commented Jun 3, 2022

All good points @TuanaCelik Will make the recommended changes.

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example below shows you how to use GPL to fine-tune a model so that it can answer the query: "How is COVID-19 transmitted?".

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using TAS-B: a DistilBERT model...
Both DistilBERT and MS MARCO were trained on data from 2018 and before, so they don't have any COVID-related information.

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this example, we're using just four documents. When you ask the model ""How is COVID-19 transmitted?", here are the answers that you get (dot-score and document):

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think # before "run" is not needed here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure.

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..PseudoLabelGenerator, (comma)
... data. (full stop)

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify that EmbeddingRetriever is adapted and save it for future use

Comment thread tutorials/Tutorial17_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's repeat our query to see if the Retriever learned about COVID and can now rank it as #1 among the answers.

@vblagoje
Copy link
Copy Markdown
Member Author

@agnieszka-m would you please confirm your recommended changes are included? TY

@vblagoje
Copy link
Copy Markdown
Member Author

Seems like it is good to go now @agnieszka-m

Comment thread docs/_src/tutorials/tutorials/18.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing events, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

Comment thread docs/_src/tutorials/tutorials/18.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains and data.

Comment thread docs/_src/tutorials/tutorials/18.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the COVID knowledge .

Comment thread docs/_src/tutorials/tutorials/18.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack

Comment thread docs/_src/tutorials/tutorials/18.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub.

Comment thread tutorials/Tutorial18_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the new COVID knowledge.

Comment thread tutorials/Tutorial18_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As a dataset, we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).

Comment thread tutorials/Tutorial18_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""## Optionally download pre-generated questions or even generate them outside of Haystack
"""## (Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack

Comment thread tutorials/Tutorial18_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub.

Comment thread tutorials/Tutorial18_GPL.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""## Verify that EmbeddingRetriever is adapted and save it for future use
"""## Verify That EmbeddingRetriever Is Adapted and Save It for Future Use

@agnieszka-m
Copy link
Copy Markdown
Contributor

Just minor updates and it's good to go

@vblagoje vblagoje merged commit b08c5f8 into deepset-ai:master Jun 26, 2022
@julian-risch
Copy link
Copy Markdown
Member

Hi @agnieszka-m @vblagoje the tutorial doesn't show up on our docs website. It still needs to be added here: https://github.com/deepset-ai/haystack-website/blob/ab49b74cdee82153f56348372e1d3ec2d294d32a/docs/latest/menu.json#L144

@vblagoje
Copy link
Copy Markdown
Member Author

@agnieszka-m I'll make the PR

Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
* Add GPL adaptation tutorial

* Latest round of Aga's corrections

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@vblagoje vblagoje deleted the gpl_tutorial branch February 28, 2023 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants