Skip to content

danieldutu/RAG-Framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

A robust Retrieval-Augmented Generation (RAG) solution for processing PDF documents, generating well-cited answers, creating evidence documents, and annotating source materials.

Features

  • 📄 PDF Processing: Extract and process text from single or multiple PDF documents
  • 🔍 Retrieval: Advanced context retrieval with semantic search capabilities
  • 📊 Comprehensive Citations: Track and cite sources with relevance scores
  • 📑 Evidence Generation: PDF evidence documents with citations
  • ✏️ Source Annotation: annotated PDFs that highlight cited text in the original documents
  • 📈 Evaluation: Assess answer quality against ground truth with metrics

Installation

Prerequisites

  • Python 3.9+
  • Google API key for Gemini models

Setup

  1. Clone the repository:
git clone https://dev.azure.com/INGCDaaS/IngOne/_git/P10926-genai-data-extraction
cd rag-pkg
  1. Create a virtual environment:
python -m venv rag-env
source rag-env/bin/activate  # On Windows: rag-env\Scripts\activate
  1. Install the package and dependencies:
pip install -e .
  1. Create a .env file in the project root with your Google API key:
GOOGLE_API_KEY=your_api_key_here
MODEL_NAME=gemini-1.5-pro
TEMPERATURE=0.2

Usage

Basic RAG Example

Process PDFs and answer questions with citations:

from rag_pkg.rag import process_pdfs, answer_questions
from rag_pkg.evidence import generate_evidence_pdf

# Process PDFs
processed_docs = process_pdfs(["path/to/document1.pdf", "path/to/document2.pdf"])

# Answer questions
questions = ["What are the key points in these documents?"]
answers, citations = answer_questions(
    processed_docs=processed_docs,
    questions=questions,
    model_name="gemini-1.5-pro",
    temperature=0.2
)

# Generate evidence document
evidence_path, _ = generate_evidence_pdf(
    citations=citations[0],
    output_path="evidence.pdf",
    question=questions[0],
    answer=answers[0]
)

print(f"Evidence document generated at: {evidence_path}")

Command-line Examples

The package comes with several example scripts:

Simple RAG Example

python src/rag_pkg/examples/simple_rag.py data/your-document.pdf -q "What is this document about?" -o output

Evaluation Example

python src/rag_pkg/examples/evaluation_example.py data/your-document.pdf -g sample_ground_truth.json -o eval_results

Annotation Example

python src/rag_pkg/examples/annotation_example.py data/your-document.pdf -q "What are the key points?" -o annotated_output

Examples

Basic Annotation

Here's an example of running the annotation functionality:

# Run the annotation example with a specific onboarding PDF and question
python src/rag_pkg/examples/annotation_example.py data/leading-edge-onboarding-best-practices-guide_1.pdf -q "What are best practices for onboarding?" -o test_annotations

This command:

  1. Processes the PDF file
  2. Answers the question
  3. Generates an evidence document with citations
  4. Creates annotated PDFs highlighting the cited text in the original document
  5. Saves all outputs to the "test_annotations" directory

With Ground Truth Evaluation

You can also provide ground truth data to evaluate the quality of the generated answers:

# Run with ground truth evaluation
python src/rag_pkg/examples/annotation_example.py data/leading-edge-onboarding-best-practices-guide_1.pdf -q "What should I do in my first 90 days?" -o first_90_days_output -g src/rag_pkg/examples/first_90_days_ground_truth.json

This will additionally:

  1. Evaluate the generated answer against your provided ground truth
  2. Generate a LaTeX evaluation report with metrics and comparisons
  3. Save evaluation results as JSON and LaTeX files in the output directory

The ground truth file should be a JSON file with this structure:

{
  "questions": [
    {
      "question": "What should I do in my first 90 days?",
      "answer": "In your first 90 days, focus on these key activities: (1) Build relationships with your supervisor through regular one-on-one meetings to discuss expectations and goals. (2) Connect with colleagues and stakeholders to understand the organizational structure and culture. (3) Complete all required onboarding training and administrative processes..."
    }
  ]
}

Command-Line Arguments

Below is a comprehensive table of all command-line arguments across the different example scripts:

Argument Short Scripts Description Default
pdf_paths - All Path(s) to PDF file(s) to process Required
--question, -q -q All Question to answer Required (simple_rag, annotation); "What are the main topics?" (evaluation)
--output-dir, -o -o All Directory to save output files "output"
--model, -m -m All Model to use for answering From env or "gemini-1.5-pro"
--temperature, -t -t All Temperature for LLM generation From env or 0.0-0.2
--debug - All Enable debug logging False
--ground-truth, -g -g Evaluation, Annotation Path to ground truth JSON file for evaluation None
--metrics, -m -m Evaluation Metrics to calculate (comma-separated) "similarity,factuality,completeness"
--report-format, -f -f Evaluation Format for evaluation report "latex"
--report-path, -r -r Evaluation Path for evaluation report "evaluation_report.{ext}"

Modules

RAG Module

The core module for document processing and question answering:

  • process_pdfs(): Convert PDFs to processable text
  • answer_questions(): Generate answers with citations based on processed documents

Evidence Module

Creates professional PDF evidence documents with:

  • Citations from source documents
  • Relevance scores
  • Page references
  • Formatted question and answer

Annotation Module

Creates annotated versions of the original PDFs with:

  • Highlighted citation text
  • Source page extraction
  • Visual indicators of cited content

Evaluation Module

Measures the quality of generated answers against ground truth:

  • Semantic similarity scoring
  • Citation accuracy assessment
  • Customizable evaluation metrics
  • LaTeX report generation with detailed comparisons

Configuration

Environment Variables

Configure the package using environment variables in a .env file:

GOOGLE_API_KEY=your_api_key_here
MODEL_NAME=gemini-1.5-pro
TEMPERATURE=0.2
LOG_LEVEL=INFO

Model Parameters

Customize generation parameters:

  • model_name: Select from available Gemini models
  • temperature: Control output randomness (0.0-1.0)
  • top_p: Nucleus sampling parameter
  • top_k: Top-k sampling parameter

Advanced Features

PDF Annotation

The annotation module creates individual PDFs for each citation with the cited text highlighted:

from rag_pkg.annotations import annotate_citations

annotated_citations = annotate_citations(
    citations=citations,
    output_dir="annotations",
    pdf_dir="path/to/pdfs"
)

Evaluation with Ground Truth

Evaluate the quality of your RAG system's answers against expected outputs:

from rag_pkg.evaluation import evaluate_answers, generate_latex_report

# Run evaluation
eval_results = evaluate_answers(
    predictions=[generated_answer],
    ground_truths=[expected_answer],
    questions=[question],
    output_json="evaluation_results.json"
)

# Generate LaTeX report
latex_path = generate_latex_report(
    evaluation_results=eval_results,
    output_path="evaluation_report.tex",
    title="RAG Evaluation Report"
)

Telemetry and Logging

The package includes comprehensive logging and telemetry:

from rag_pkg.utils import configure_logging, configure_telemetry

configure_logging(level="DEBUG")
configure_telemetry(service_name="rag-application")

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Commit your changes: git commit -m 'Add feature'
  4. Push to the branch: git push origin feature-name
  5. Submit a pull request

Contact

For questions or support, please open WBAA team member.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages