Skip to content

Fix rag#38585

Merged
ydshieh merged 3 commits intomainfrom
fix_rag
Jun 23, 2025
Merged

Fix rag#38585
ydshieh merged 3 commits intomainfrom
fix_rag

Conversation

@ydshieh
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh commented Jun 4, 2025

What does this PR do?

datasets introduce trust_remote_code at some point (probably 2024/09), but RAG's modeling code isn't handling this, and we get

ValueError: The repository for wiki_dpr contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wiki_dpr.

This PR makes necessary changes for users could at least specify this argument in from_pretrained that would pass to the datasets call.

@ydshieh ydshieh changed the title Fix rag Fix RAG Jun 4, 2025
Comment on lines +454 to +460
trust_remote_code = kwargs.get("trust_remote_code", False)
index = cls._build_index(config, trust_remote_code=trust_remote_code)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass down to dataset call

split=self.dataset_split,
dummy=self.use_dummy_dataset,
revision=dataset_revision,
trust_remote_code=trust_remote_code,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is load_dataset

Comment thread tests/models/rag/test_modeling_rag.py Outdated
tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
retriever = RagRetriever.from_pretrained(
"facebook/rag-sequence-nq", index_name="exact", use_dummy_dataset=True, dataset_revision="b24a417"
"facebook/rag-sequence-nq", index_name="exact", use_dummy_dataset=True, dataset_revision="b24a417", trust_remote_code=True,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update other tests, but I would like to hear your opinion on the modeling code changes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine to me!

@ydshieh ydshieh requested a review from LysandreJik June 4, 2025 14:48
@ydshieh ydshieh changed the title Fix RAG [don't merge yet] Fix RAG Jun 4, 2025
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines -844 to -845
EXPECTED_OUTPUT_TEXT_1 = "\"She's My Kind of Girl"
EXPECTED_OUTPUT_TEXT_2 = "\"She's My Kind of Love"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we have this short expected outputs.

It's already not matching back to 2024/04 and maybe even earlier.

https://huggingface.slack.com/archives/C06LR9PQA00/p1712388818140089?thread_ts=1712388743.278929&cid=C06LR9PQA00

@ydshieh ydshieh changed the title [don't merge yet] Fix RAG Fix rag Jun 19, 2025
@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Jun 19, 2025

@Cyrilvallez ready for a review.

For the changes in the tests, there is something for @itazap to check.

Comment thread tests/models/rag/test_modeling_rag.py Outdated
Comment on lines +933 to +935
# PR #31938 cause the output being changed from `june 22, 2018` to `june 22 , 2018`.
# Need @itazap to take a look
# TODO: itazap
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @itazap

Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright, LGTM in general! Let's just wait for @itazap to confirm before merging!

@itazap
Copy link
Copy Markdown
Collaborator

itazap commented Jun 20, 2025

The additional 'space' after the comma in the test june 22 , 2018 likely should be expected (ie, not stripped), since rag uses BART. test_tokenization_rag is minimal so it isn't explicitly clear how to handle spaces around punctuation. Since #31938 removed clean_up_tokenization_spaces=True by default, and the bart/rag tests continued to pass, and so for me it's safe to conclude that clean_up_tokenization_spaces=False, and the space after the comma should be preserved -> as updated in this PR!

TLDR: LGTM :)

@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Jun 20, 2025

Thank you @itazap for double checking 👍

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

Comment thread tests/models/rag/test_modeling_rag.py Outdated
"wget",
"-O",
f"{cls.index_path}",
"https://huggingface.co/datasets/hf-internal-testing/wiki_dpr_dummy/resolve/main/index",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename the file "index.faiss" ? and optionally include what kind of index it is, here I assume "flat_index.faiss" ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename and maybe give the steps of how i produce this dummy dataset for testing purpose.

@ydshieh ydshieh merged commit f9be71b into main Jun 23, 2025
15 checks passed
@ydshieh ydshieh deleted the fix_rag branch June 23, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants