A robust Retrieval-Augmented Generation (RAG) solution for processing PDF documents, generating well-cited answers, creating evidence documents, and annotating source materials.
- 📄 PDF Processing: Extract and process text from single or multiple PDF documents
- 🔍 Retrieval: Advanced context retrieval with semantic search capabilities
- 📊 Comprehensive Citations: Track and cite sources with relevance scores
- 📑 Evidence Generation: PDF evidence documents with citations
- ✏️ Source Annotation: annotated PDFs that highlight cited text in the original documents
- 📈 Evaluation: Assess answer quality against ground truth with metrics
- Python 3.9+
- Google API key for Gemini models
- Clone the repository:
git clone https://dev.azure.com/INGCDaaS/IngOne/_git/P10926-genai-data-extraction
cd rag-pkg- Create a virtual environment:
python -m venv rag-env
source rag-env/bin/activate # On Windows: rag-env\Scripts\activate- Install the package and dependencies:
pip install -e .- Create a
.envfile in the project root with your Google API key:
GOOGLE_API_KEY=your_api_key_here
MODEL_NAME=gemini-1.5-pro
TEMPERATURE=0.2
Process PDFs and answer questions with citations:
from rag_pkg.rag import process_pdfs, answer_questions
from rag_pkg.evidence import generate_evidence_pdf
# Process PDFs
processed_docs = process_pdfs(["path/to/document1.pdf", "path/to/document2.pdf"])
# Answer questions
questions = ["What are the key points in these documents?"]
answers, citations = answer_questions(
processed_docs=processed_docs,
questions=questions,
model_name="gemini-1.5-pro",
temperature=0.2
)
# Generate evidence document
evidence_path, _ = generate_evidence_pdf(
citations=citations[0],
output_path="evidence.pdf",
question=questions[0],
answer=answers[0]
)
print(f"Evidence document generated at: {evidence_path}")The package comes with several example scripts:
python src/rag_pkg/examples/simple_rag.py data/your-document.pdf -q "What is this document about?" -o outputpython src/rag_pkg/examples/evaluation_example.py data/your-document.pdf -g sample_ground_truth.json -o eval_resultspython src/rag_pkg/examples/annotation_example.py data/your-document.pdf -q "What are the key points?" -o annotated_outputHere's an example of running the annotation functionality:
# Run the annotation example with a specific onboarding PDF and question
python src/rag_pkg/examples/annotation_example.py data/leading-edge-onboarding-best-practices-guide_1.pdf -q "What are best practices for onboarding?" -o test_annotationsThis command:
- Processes the PDF file
- Answers the question
- Generates an evidence document with citations
- Creates annotated PDFs highlighting the cited text in the original document
- Saves all outputs to the "test_annotations" directory
You can also provide ground truth data to evaluate the quality of the generated answers:
# Run with ground truth evaluation
python src/rag_pkg/examples/annotation_example.py data/leading-edge-onboarding-best-practices-guide_1.pdf -q "What should I do in my first 90 days?" -o first_90_days_output -g src/rag_pkg/examples/first_90_days_ground_truth.jsonThis will additionally:
- Evaluate the generated answer against your provided ground truth
- Generate a LaTeX evaluation report with metrics and comparisons
- Save evaluation results as JSON and LaTeX files in the output directory
The ground truth file should be a JSON file with this structure:
{
"questions": [
{
"question": "What should I do in my first 90 days?",
"answer": "In your first 90 days, focus on these key activities: (1) Build relationships with your supervisor through regular one-on-one meetings to discuss expectations and goals. (2) Connect with colleagues and stakeholders to understand the organizational structure and culture. (3) Complete all required onboarding training and administrative processes..."
}
]
}Below is a comprehensive table of all command-line arguments across the different example scripts:
| Argument | Short | Scripts | Description | Default |
|---|---|---|---|---|
pdf_paths |
- | All | Path(s) to PDF file(s) to process | Required |
--question, -q |
-q |
All | Question to answer | Required (simple_rag, annotation); "What are the main topics?" (evaluation) |
--output-dir, -o |
-o |
All | Directory to save output files | "output" |
--model, -m |
-m |
All | Model to use for answering | From env or "gemini-1.5-pro" |
--temperature, -t |
-t |
All | Temperature for LLM generation | From env or 0.0-0.2 |
--debug |
- | All | Enable debug logging | False |
--ground-truth, -g |
-g |
Evaluation, Annotation | Path to ground truth JSON file for evaluation | None |
--metrics, -m |
-m |
Evaluation | Metrics to calculate (comma-separated) | "similarity,factuality,completeness" |
--report-format, -f |
-f |
Evaluation | Format for evaluation report | "latex" |
--report-path, -r |
-r |
Evaluation | Path for evaluation report | "evaluation_report.{ext}" |
The core module for document processing and question answering:
process_pdfs(): Convert PDFs to processable textanswer_questions(): Generate answers with citations based on processed documents
Creates professional PDF evidence documents with:
- Citations from source documents
- Relevance scores
- Page references
- Formatted question and answer
Creates annotated versions of the original PDFs with:
- Highlighted citation text
- Source page extraction
- Visual indicators of cited content
Measures the quality of generated answers against ground truth:
- Semantic similarity scoring
- Citation accuracy assessment
- Customizable evaluation metrics
- LaTeX report generation with detailed comparisons
Configure the package using environment variables in a .env file:
GOOGLE_API_KEY=your_api_key_here
MODEL_NAME=gemini-1.5-pro
TEMPERATURE=0.2
LOG_LEVEL=INFO
Customize generation parameters:
model_name: Select from available Gemini modelstemperature: Control output randomness (0.0-1.0)top_p: Nucleus sampling parametertop_k: Top-k sampling parameter
The annotation module creates individual PDFs for each citation with the cited text highlighted:
from rag_pkg.annotations import annotate_citations
annotated_citations = annotate_citations(
citations=citations,
output_dir="annotations",
pdf_dir="path/to/pdfs"
)Evaluate the quality of your RAG system's answers against expected outputs:
from rag_pkg.evaluation import evaluate_answers, generate_latex_report
# Run evaluation
eval_results = evaluate_answers(
predictions=[generated_answer],
ground_truths=[expected_answer],
questions=[question],
output_json="evaluation_results.json"
)
# Generate LaTeX report
latex_path = generate_latex_report(
evaluation_results=eval_results,
output_path="evaluation_report.tex",
title="RAG Evaluation Report"
)The package includes comprehensive logging and telemetry:
from rag_pkg.utils import configure_logging, configure_telemetry
configure_logging(level="DEBUG")
configure_telemetry(service_name="rag-application")Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Commit your changes:
git commit -m 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
For questions or support, please open WBAA team member.