Add GPL adaptation tutorial by vblagoje · Pull Request #2632 · deepset-ai/haystack

vblagoje · 2022-06-03T14:31:12Z

Proposed changes:

Adds GPL tutorial

cc @TuanaCelik

review-notebook-app · 2022-06-03T14:31:16Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

TuanaCelik

Hey @vblagoje - This looks good. A few things I would do:

Some hand holding maybe. Some explanation on what Generative Pseudo Labeling is and then short descriptions of what a cell block is about to do just above it (you already have this for some of them)
I think we may have to add this to the headers here but I will check this.
I would maybe include the full name 'Generative Pseudo Labeling' in the title(s) too. If I understand the tutorial correctly something like this might fit wdyt?: "Generative Pseudo Labeling for Domain Adaptation of Dense Retrieval" or "Domain Adaptation of Dense Retrieval with GPL" (fair if you think it's too long)

vblagoje · 2022-06-03T16:26:41Z

All good points @TuanaCelik Will make the recommended changes.

agnieszka-m · 2022-06-06T11:15:24Z

How about:
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

agnieszka-m · 2022-06-06T11:18:32Z

The example below shows you how to use GPL to fine-tune a model so that it can answer the query: "How is COVID-19 transmitted?".

agnieszka-m · 2022-06-06T11:20:08Z

We're using TAS-B: a DistilBERT model...
Both DistilBERT and MS MARCO were trained on data from 2018 and before, so they don't have any COVID-related information.

agnieszka-m · 2022-06-06T11:22:06Z

For this example, we're using just four documents. When you ask the model ""How is COVID-19 transmitted?", here are the answers that you get (dot-score and document):

agnieszka-m · 2022-06-06T11:23:25Z

You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.

agnieszka-m · 2022-06-06T11:38:04Z

I think # before "run" is not needed here.

agnieszka-m · 2022-06-06T11:38:22Z

missing full stop

agnieszka-m · 2022-06-06T11:39:04Z

..PseudoLabelGenerator, (comma)
... data. (full stop)

agnieszka-m · 2022-06-06T11:40:21Z

Verify that EmbeddingRetriever is adapted and save it for future use

agnieszka-m · 2022-06-06T11:41:21Z

Let's repeat our query to see if the Retriever learned about COVID and can now rank it as #1 among the answers.

vblagoje · 2022-06-22T17:54:39Z

@agnieszka-m would you please confirm your recommended changes are included? TY

vblagoje · 2022-06-23T12:44:24Z

Seems like it is good to go now @agnieszka-m

agnieszka-m · 2022-06-23T07:22:16Z

Suggested change

The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing events, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

agnieszka-m · 2022-06-23T07:23:56Z

Suggested change

This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.

This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains and data.

agnieszka-m · 2022-06-23T07:25:00Z

Suggested change

We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.

We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the COVID knowledge .

agnieszka-m · 2022-06-23T07:26:49Z

(Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack

agnieszka-m · 2022-06-23T07:27:20Z

Suggested change

The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.

The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub.

agnieszka-m · 2022-06-23T07:45:58Z

Suggested change

We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.

We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the new COVID knowledge.

agnieszka-m · 2022-06-23T19:14:51Z

Suggested change

We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).

We select 10k scientific publications (title + abstract) that are connected to COVID-19. As a dataset, we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).

agnieszka-m · 2022-06-23T19:15:54Z

Suggested change

"""## Optionally download pre-generated questions or even generate them outside of Haystack

"""## (Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack

agnieszka-m · 2022-06-23T19:16:27Z

Suggested change

The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.

The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub.

agnieszka-m · 2022-06-23T19:17:50Z

Suggested change

"""## Verify that EmbeddingRetriever is adapted and save it for future use

"""## Verify That EmbeddingRetriever Is Adapted and Save It for Future Use

agnieszka-m · 2022-06-23T19:18:43Z

Just minor updates and it's good to go

julian-risch · 2022-06-27T07:29:22Z

Hi @agnieszka-m @vblagoje the tutorial doesn't show up on our docs website. It still needs to be added here: https://github.com/deepset-ai/haystack-website/blob/ab49b74cdee82153f56348372e1d3ec2d294d32a/docs/latest/menu.json#L144

vblagoje · 2022-06-27T07:41:21Z

@agnieszka-m I'll make the PR

* Add GPL adaptation tutorial * Latest round of Aga's corrections * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

vblagoje requested a review from TuanaCelik June 3, 2022 14:31

vblagoje added the type:documentation Improvements on the docs label Jun 3, 2022

TuanaCelik reviewed Jun 3, 2022

View reviewed changes

agnieszka-m requested changes Jun 6, 2022

View reviewed changes

agnieszka-m requested changes Jun 23, 2022

View reviewed changes

vblagoje and others added 3 commits June 24, 2022 15:25

Add GPL adaptation tutorial

078c392

Latest round of Aga's corrections

e8dded7

Update Documentation & Code Style

cd6d98d

agnieszka-m approved these changes Jun 24, 2022

View reviewed changes

vblagoje merged commit b08c5f8 into deepset-ai:master Jun 26, 2022

vblagoje mentioned this pull request Jun 27, 2022

GPL tutorial - add GPU header and open in colab button #2736

Merged

julian-risch added the topic:tutorials label Jul 5, 2022

vblagoje deleted the gpl_tutorial branch February 28, 2023 12:08

	The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.
	The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing events, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

	This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.
	This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains and data.

	We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
	We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the COVID knowledge .

	We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).
	We select 10k scientific publications (title + abstract) that are connected to COVID-19. As a dataset, we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).

	"""## Optionally download pre-generated questions or even generate them outside of Haystack
	"""## (Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack

	"""## Verify that EmbeddingRetriever is adapted and save it for future use
	"""## Verify That EmbeddingRetriever Is Adapted and Save It for Future Use

Conversation

vblagoje commented Jun 3, 2022

Uh oh!

review-notebook-app Bot commented Jun 3, 2022

Uh oh!

TuanaCelik left a comment

Choose a reason for hiding this comment

Uh oh!

vblagoje commented Jun 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vblagoje commented Jun 22, 2022

Uh oh!

vblagoje commented Jun 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agnieszka-m commented Jun 23, 2022

Uh oh!

julian-risch commented Jun 27, 2022

Uh oh!

vblagoje commented Jun 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants