PyTerrier-RAG is an extension for PyTerrier that makes it easier to produce retrieval augmented generation pipelines. PyTerrier-RAG supports:
- Easy access to common QA datasets
- Pre-built indices for common corpora
- Popular reader models, such as Fusion-in-Decoder, LLama
- Evaluation measures
As well as access to all of the retrievers (sparse, learned sparse and dense) and rerankers (from MonoT5 to RankGPT) accessible through the wider PyTerrier ecosystem.
Installation is as easy as pip install pyterrier-rag.
Try it out here on Google Colab now by clicking the "Open in Colab" button!
- Sparse Retrieval with FiD and FlanT5 readers: sparse_retrieval_FiD_FlanT5.ipynb
- SearchR1 with Sparse Retrieval and MonoT5: examples/search-r1.ipynb
Reader can be constructed using a Reader class using different Backend implementations:
from pyterrier_rag.readers import Reader
from pyterrier_rag import Seq2SeqLMBackend, OpenAIBackend, VLLMBackend
flanT5 = Reader(Seq2SeqLMBackend("google/flan-t5-base"))
llamma = Reader(OpenAIBackend("llama-3-8b-instruct", api_key="your_api_key", base_url="your_api_url"))
deepseek = Reader(VLLMBackend("deepseek-ai/DeepSeek-R1-Distill-Llama-8B"))We also have specialist reader classes, such as Fusion in Decoder (FiD): pyterrier_rag.readers.T5FiD, pyterrier_rag.readers.BARTFiD.
RAG pipelines can be formulated as easily as:
bm25 = pt.terrier.Retriever()
fid = pyterrier_rag.readers.T5FiD()
bm25_rag = bm25 % 10 >> fid
monoT5_rag = bm25 % 10 >> MonoT5() >> fid
monoT5_rag.search("What are chemical reactions?")Try it out now with the example notebook: sparse_retrieval_FiD_FlanT5.ipynb .
These frameworks use search as a tool - the reasoning model decides when to search, and then integrates the retrieved results into the input for the next invocation of the model:
- Search-R1:
pyterrier_rag.SearchR1https://arxiv.org/pdf/2503.09516 - Search-O1:
pyterrier_rag.SearchO1https://arxiv.org/abs/2501.05366 - R1-Searcher:
pyterrier_rag.R1Searcherhttps://arxiv.org/abs/2503.05592
bm25 = pt.Artifact.from_hf('pyterrier/ragwiki-terrier').bm25(include_fields=['docno', 'text', 'title'])
monoT5 = pyterrier_t5.MonoT5()
r1_monoT5 = pyterrier_rag.SearchR1(bm25 % 20 >> monoT5)
r1_monoT5.search("What are chemical reactions?")Try these frameworks out now with our example notebooks:
Queries and gold answers of common datasets can be accessed through the PyTerrier datasets API, e.g.: pt.get_dataset("rag:nq").get_topics('dev') and pt.get_dataset("rag:nq").get_answers('dev'). The following QA datasets are available:
- Natural Questions:
"rag:nq" - HotpotQA:
"rag:hotpotqa" - TriviaQA:
"rag:triviaqa" - Musique:
"rag:musique" - WebQuestions:
"rag:web_questions" - WoW:
"rag:wow" - PopQA:
"rag:popqa"
We also provide pre-built indices for some standard RAG corpora. For instance, a BM25 retriever for the Wikipedia corpus for NQ can be obtained from an pre-existing index autoamticallty downloaded from HuggingFace:
sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')
bm25 = pt.rewrite.tokenise() >> sparse_index.bm25(include_fields=['docno', 'text', 'title']) >> pt.rewrite.reset()Dense indices are also provided, e.g. E5 on Wikipedia:
import pyterrier_dr
e5 = pyterrier_dr.E5() >> pt.Artifact.from_hf("pyterrier/ragwiki-e5.flex") >> sparse_index.text_loader(['docno', 'title', 'text'])An experiment comparing multiple RAG pipelines can be expressed using PyTerrier's pt.Experiment() API:
pt.Experiment(
[pipe1, pipe2],
dataset.get_topics(),
dataset.get_answers(),
[pyterrier_rag.measures.EM, pyterrier_rag.measures.F1]
)Available measures include:
- Answer length:
pyterrier_rag.measures.AnswerLen - Answers of 0 length:
pyterrier_rag.measures.AnswerZeroLen - Exact match percentage:
pyterrier_rag.measures.EM - F1:
pyterrier_rag.measures.F1 - BERTScore (measures similarity of answer with relevant documents):
pyterrier_rag.measures.BERTScore - ROUGE, e.g.
pyterrier_rag.measures.ROUGE1F
Use the baseline kwarg to conduct significance testing in your experiment - see the pt.Experiment() documentation for more examples.
If you use PyTerrier-RAG for you research, please cite our work:
Constructing and Evaluating Declarative RAG Pipelines in PyTerrier. Craig Macdonald, Jinyuan Fang, Andrew Parry and Zaiqiao Meng. In Proceedings of SIGIR 2025. https://arxiv.org/abs/2506.10802
- Craig Macdonald, University of Glasgow
- Jinyuan Fang, University of Glasgow
- Andrew Parry, University of Glasgow
- Zhili Shen, University of Glasgow
- Yulin Qiao, University of Glasgow
- Jie Zhan, University of Glasgow
- Zaiqiao Meng, University of Glasgow
- Sean MacAvaney, University of Glasgow