GitHub - danieldutu/RAG-Framework

A robust Retrieval-Augmented Generation (RAG) solution for processing PDF documents, generating well-cited answers, creating evidence documents, and annotating source materials.

Features

📄 PDF Processing: Extract and process text from single or multiple PDF documents
🔍 Retrieval: Advanced context retrieval with semantic search capabilities
📊 Comprehensive Citations: Track and cite sources with relevance scores
📑 Evidence Generation: PDF evidence documents with citations
✏️ Source Annotation: annotated PDFs that highlight cited text in the original documents
📈 Evaluation: Assess answer quality against ground truth with metrics

Installation

Prerequisites

Python 3.9+
Google API key for Gemini models

Setup

Clone the repository:

git clone https://dev.azure.com/INGCDaaS/IngOne/_git/P10926-genai-data-extraction
cd rag-pkg

Create a virtual environment:

python -m venv rag-env
source rag-env/bin/activate  # On Windows: rag-env\Scripts\activate

Install the package and dependencies:

pip install -e .

Create a .env file in the project root with your Google API key:

GOOGLE_API_KEY=your_api_key_here
MODEL_NAME=gemini-1.5-pro
TEMPERATURE=0.2

Usage

Basic RAG Example

Process PDFs and answer questions with citations:

from rag_pkg.rag import process_pdfs, answer_questions
from rag_pkg.evidence import generate_evidence_pdf

# Process PDFs
processed_docs = process_pdfs(["path/to/document1.pdf", "path/to/document2.pdf"])

# Answer questions
questions = ["What are the key points in these documents?"]
answers, citations = answer_questions(
    processed_docs=processed_docs,
    questions=questions,
    model_name="gemini-1.5-pro",
    temperature=0.2
)

# Generate evidence document
evidence_path, _ = generate_evidence_pdf(
    citations=citations[0],
    output_path="evidence.pdf",
    question=questions[0],
    answer=answers[0]
)

print(f"Evidence document generated at: {evidence_path}")

Command-line Examples

The package comes with several example scripts:

Simple RAG Example

python src/rag_pkg/examples/simple_rag.py data/your-document.pdf -q "What is this document about?" -o output

Evaluation Example

python src/rag_pkg/examples/evaluation_example.py data/your-document.pdf -g sample_ground_truth.json -o eval_results

Annotation Example

python src/rag_pkg/examples/annotation_example.py data/your-document.pdf -q "What are the key points?" -o annotated_output

Examples

Basic Annotation

Here's an example of running the annotation functionality:

# Run the annotation example with a specific onboarding PDF and question
python src/rag_pkg/examples/annotation_example.py data/leading-edge-onboarding-best-practices-guide_1.pdf -q "What are best practices for onboarding?" -o test_annotations

This command:

Processes the PDF file
Answers the question
Generates an evidence document with citations
Creates annotated PDFs highlighting the cited text in the original document
Saves all outputs to the "test_annotations" directory

With Ground Truth Evaluation

You can also provide ground truth data to evaluate the quality of the generated answers:

# Run with ground truth evaluation
python src/rag_pkg/examples/annotation_example.py data/leading-edge-onboarding-best-practices-guide_1.pdf -q "What should I do in my first 90 days?" -o first_90_days_output -g src/rag_pkg/examples/first_90_days_ground_truth.json

This will additionally:

Evaluate the generated answer against your provided ground truth
Generate a LaTeX evaluation report with metrics and comparisons
Save evaluation results as JSON and LaTeX files in the output directory

The ground truth file should be a JSON file with this structure:

{
  "questions": [
    {
      "question": "What should I do in my first 90 days?",
      "answer": "In your first 90 days, focus on these key activities: (1) Build relationships with your supervisor through regular one-on-one meetings to discuss expectations and goals. (2) Connect with colleagues and stakeholders to understand the organizational structure and culture. (3) Complete all required onboarding training and administrative processes..."
    }
  ]
}

Command-Line Arguments

Below is a comprehensive table of all command-line arguments across the different example scripts:

Argument	Short	Scripts	Description	Default
`pdf_paths`	-	All	Path(s) to PDF file(s) to process	Required
`--question`, `-q`	`-q`	All	Question to answer	Required (simple_rag, annotation); "What are the main topics?" (evaluation)
`--output-dir`, `-o`	`-o`	All	Directory to save output files	"output"
`--model`, `-m`	`-m`	All	Model to use for answering	From env or "gemini-1.5-pro"
`--temperature`, `-t`	`-t`	All	Temperature for LLM generation	From env or 0.0-0.2
`--debug`	-	All	Enable debug logging	False
`--ground-truth`, `-g`	`-g`	Evaluation, Annotation	Path to ground truth JSON file for evaluation	None
`--metrics`, `-m`	`-m`	Evaluation	Metrics to calculate (comma-separated)	"similarity,factuality,completeness"
`--report-format`, `-f`	`-f`	Evaluation	Format for evaluation report	"latex"
`--report-path`, `-r`	`-r`	Evaluation	Path for evaluation report	"evaluation_report.{ext}"

Modules

RAG Module

The core module for document processing and question answering:

process_pdfs(): Convert PDFs to processable text
answer_questions(): Generate answers with citations based on processed documents

Evidence Module

Creates professional PDF evidence documents with:

Citations from source documents
Relevance scores
Page references
Formatted question and answer

Annotation Module

Creates annotated versions of the original PDFs with:

Highlighted citation text
Source page extraction
Visual indicators of cited content

Evaluation Module

Measures the quality of generated answers against ground truth:

Semantic similarity scoring
Citation accuracy assessment
Customizable evaluation metrics
LaTeX report generation with detailed comparisons

Configuration

Environment Variables

Configure the package using environment variables in a .env file:

GOOGLE_API_KEY=your_api_key_here
MODEL_NAME=gemini-1.5-pro
TEMPERATURE=0.2
LOG_LEVEL=INFO

Model Parameters

Customize generation parameters:

model_name: Select from available Gemini models
temperature: Control output randomness (0.0-1.0)
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter

Advanced Features

PDF Annotation

The annotation module creates individual PDFs for each citation with the cited text highlighted:

from rag_pkg.annotations import annotate_citations

annotated_citations = annotate_citations(
    citations=citations,
    output_dir="annotations",
    pdf_dir="path/to/pdfs"
)

Evaluation with Ground Truth

Evaluate the quality of your RAG system's answers against expected outputs:

from rag_pkg.evaluation import evaluate_answers, generate_latex_report

# Run evaluation
eval_results = evaluate_answers(
    predictions=[generated_answer],
    ground_truths=[expected_answer],
    questions=[question],
    output_json="evaluation_results.json"
)

# Generate LaTeX report
latex_path = generate_latex_report(
    evaluation_results=eval_results,
    output_path="evaluation_report.tex",
    title="RAG Evaluation Report"
)

Telemetry and Logging

The package includes comprehensive logging and telemetry:

from rag_pkg.utils import configure_logging, configure_telemetry

configure_logging(level="DEBUG")
configure_telemetry(service_name="rag-application")

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch: git checkout -b feature-name
Commit your changes: git commit -m 'Add feature'
Push to the branch: git push origin feature-name
Submit a pull request

Contact

For questions or support, please open WBAA team member.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
RAG Framework		RAG Framework
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Installation

Prerequisites

Setup

Usage

Basic RAG Example

Command-line Examples

Simple RAG Example

Evaluation Example

Annotation Example

Examples

Basic Annotation

With Ground Truth Evaluation

Command-Line Arguments

Modules

RAG Module

Evidence Module

Annotation Module

Evaluation Module

Configuration

Environment Variables

Model Parameters

Advanced Features

PDF Annotation

Evaluation with Ground Truth

Telemetry and Logging

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Installation

Prerequisites

Setup

Usage

Basic RAG Example

Command-line Examples

Simple RAG Example

Evaluation Example

Annotation Example

Examples

Basic Annotation

With Ground Truth Evaluation

Command-Line Arguments

Modules

RAG Module

Evidence Module

Annotation Module

Evaluation Module

Configuration

Environment Variables

Model Parameters

Advanced Features

PDF Annotation

Evaluation with Ground Truth

Telemetry and Logging

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages