Fix rag finetuning + add finetuning test#8585
Conversation
f9ffc27 to
e480871
Compare
|
Hi, I tried to execute finetune.py on two GPUs. It mainly fails with the following error. But when I run with the single GPU it works. I have also attached a screen shot. RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [130.216.209.142]:55728 |
|
What command did you run exactly ? |
|
|
Does changing the port with |
|
It says, |
|
I tried with |
|
What's your pytorch lightning version ? |
Version: 1.0.4 |
|
@lhoestq
Hi just wanted to know .. did you managed to run the finetune.sh script
without any errors.
|
|
Ok I fixed the tensor issue and updated the readme I also had to rename some the examples files of RAG to avoid collisions with the files of the seq2seq examples. The name collision broke the CI tests with failed imports. I did: All tests are green now :) |
|
Thanks a lot for your quick response.
…On Sat, Nov 21, 2020, 00:15 Quentin Lhoest ***@***.***> wrote:
Ok I fixed the tensor issue and updated the readme
I also had to rename some the examples files of RAG to avoid collisions
with the files of the seq2seq examples. The name collision broke the CI
tests with failed imports.
I did:
examples/rag/utils.py -> exmaples/rag/utils_rag.py
examples/rag/callbacks.py -> exmaples/rag/callbacks_rag.py
examples/rag/finetune.py -> exmaples/rag/finetune_rag.py
examples/rag/finetune.sh -> exmaples/rag/finetune_rag.sh
All tests are green now :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8585 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA4FGTA626DKBJ5S7JXLFDSQZFVNANCNFSM4TYMOBIQ>
.
|
| mode="max", | ||
| save_top_k=3, | ||
| period=0, # maybe save a checkpoint every time val is run, not just end of epoch. | ||
| period=1, # maybe save a checkpoint every time val is run, not just end of epoch. |
There was a problem hiding this comment.
why go from 0 to 1?
There was a problem hiding this comment.
Oh I changed that to speed the the test and forgot to remove it. I can just modify the validation frequency
patrickvonplaten
left a comment
There was a problem hiding this comment.
Awesome test! Just two things if we could I'd rather avoid them:
- Don't like adding dummy data files to master
- Would be nice to not force
return_dict=True
|
I took your comment into account @patrickvonplaten |
|
@lhoestq hello, thank you for this amazing feature. when I try to create my custom dataset I receveing this error:
I'm using Google Colab to test this - https://colab.research.google.com/drive/1Cjj18rYmeS0Bueis_KPB5Wbybl-JNDLL?usp=sharing |
|
Well, i didn't install the specific dependencies you defined. excuse me. Solved running - !pip install -r /transformers/examples/rag/requirements.txt At least it is registered if someone has the same problem. haha |


Following #7715 we need more test coverage of the RAG example scripts.
In this PR I'm adding a test for the finetuning script.
The test includes a single gpu test and a multi gpu test. Both are passing.
As mentioned in #7816 and #8345 there were some errors in the script that I had to fix.
Moreover since @amogkam has been working on the finetuning script as well to integrate Ray, I made sure to reduce the possible conflicts with his PR #8583 . More precisely I'm reusing the CustomAccel class that will allow to init either the pytorch distributed retrieval or the ray distributed retrieval.
Also fixed a bug in RAG forward pass (see #8665 )
Fix #7816
Fix #8345