A scalable Retrieval-Augmented Generation (RAG) system that can index and query multiple documents with memory-efficient processing.
- 📄 Index multiple PDFs with memory-efficient processing
- 🔎 Vector database integration (Weaviate) for semantic search
- 🧠 Advanced document chunking with configurable sizes and overlap
- 🤖 Google's Generative AI (Gemini) integration
- 🚀 Memory-efficient processing designed for large documents
- 🛠️ Configurable page range processing for partial document indexing
- 📊 Detailed logging and progress tracking
- Python 3.10+
- Docker and Docker Compose (for Weaviate vector database)
- Conda (recommended for environment management)
# Create conda environment
conda create -n arag_env python=3.10
conda activate arag_env
# Install dependencies
pip install -r requirements.txt
# Install the local package in development mode
pip install -e .# Start Weaviate using Docker Compose
docker-compose up -d weaviateCreate or edit a .env file in the root directory with the following:
# Google AI API Key
GOOGLE_API_KEY=your_google_api_key_here
# Vector Database Configuration
VECTOR_DB_BASE_URL=http://localhost:8081
VECTOR_DB_TYPE=weaviate
WEAVIATE_CLASS_NAME=Document
# Logging Configuration
LOG_LEVEL=INFO
You can obtain a Google AI API key from the Google AI Studio.
For large documents, index with memory-efficient settings and specific page ranges:
# For PowerShell
conda activate arag_env; python index_pdfs.py --path "data/your_document.pdf" --chunk-size 150 --overlap 25 --batch-size 2 --start-page 1 --end-page 20# For PowerShell
conda activate arag_env; python index_pdfs.py --path "data/" --chunk-size 150 --overlap 25 --batch-size 2# For PowerShell
conda activate arag_env; python run.py "Your question about the indexed documents?"If you encounter Out of Memory (OOM) errors:
- Reduce
chunk-size(e.g., 150 chars) - Reduce
batch-size(e.g., 2) - Process smaller page ranges with
--start-pageand--end-page - Increase system swap space
If Weaviate doesn't start properly:
# Check Weaviate container logs
docker logs weaviate
# Restart Weaviate
docker-compose restart weaviate.
├── arag/ # Core RAG system modules
│ ├── agents/ # Agent implementations
│ ├── core/ # Core functionality
│ │ ├── ai_client.py # AI models integration
│ │ ├── memory.py # Memory management
│ │ ├── orchestrator.py # Query processing orchestration
│ │ └── vector_db.py # Vector database client
│ └── utils/ # Utilities
├── data/ # Store your PDF documents here
├── logs/ # Log files
├── output/ # Query output and memory files
├── docker-compose.yaml # Docker configuration
├── Dockerfile # For containerization
├── index_pdfs.py # PDF indexing script
├── run.py # Main query script
└── run_indexing.bat # Windows batch script for indexing
chunk-size: Size of text chunks (smaller = more memory-efficient)overlap: Overlap between chunks to maintain contextbatch-size: Number of chunks to process at once (smaller = more memory-efficient)
Process specific pages to:
- Test indexing before committing to full document
- Resume interrupted indexing
- Update specific sections of documents
# Process pages 5-15 only
python index_pdfs.py --path "data/document.pdf" --start-page 5 --end-page 15This project is licensed under the MIT License - see the LICENSE file for details.