DocuRAG is an interactive Retrieval-Augmented Generation (RAG) application that lets you process documents or URLs and ask natural-language questions about their content. It demonstrates a typical RAG workflow: ingest → split → embed → index → retrieve → generate.
👉 Live Demo: Click here to try the app
- Upload documents (
.pdf,.txt,.docx) or provide a URL to ingest. - Split documents into overlapping chunks for retrieval.
- Create embeddings and store vectors in a local FAISS index.
- Query the index and generate answers with an LLM (Gemini/other).
- Simple Streamlit UI for demos; the core pipeline is reusable.
DocuRAG.py— Streamlit user interface (thin wrapper).core/— backend modules (loader, embedding wrapper, retriever, pipeline, config, errors).tests/— unit tests covering core behaviors.faiss_store_gemini/— local FAISS index directory (ignored by git).
- Python 3.9+
- Streamlit for demo UI
- LangChain and LangChain community loaders
- FAISS (vector store)
- Google Gemini (optional) or any compatible embedding/LLM provider
- Clone the repository
git clone https://github.com/your-username/DocuRAG.git
cd DocuRAG- Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows- Install dependencies
pip install -r requirements.txtCreate a .env file in the repository root with at least your API key(s):
GEMINI_API_KEY=your_gemini_api_key
# or use GOOGLE_API_KEY if applicable
Other runtime configuration options are in core/config.py:
INDEX_PATH— path for the FAISS index (default:faiss_store_gemini).MAX_UPLOAD_BYTES— maximum allowed upload size for files.ALLOW_DANGEROUS_DESERIALIZATION— whenTrue, allows deserializing certain index files that require unpickling. This can execute arbitrary code if the index file is from an untrusted source; set toFalsein production or when loading indexes from unknown origins.
If you want to rebuild the index instead of loading an existing one,
delete the faiss_store_gemini/ directory and re-process your documents.
Run the Streamlit demo:
streamlit run DocuRAG.pyFollow the UI to paste a URL or upload a file, process the data to build an index, and then ask questions about the processed content.
Run unit tests with:
python -m unittest discover -s tests -p "test_*.py" -vThe tests use lightweight fakes so they run without heavy third-party LLM or FAISS packages.
- Never commit your
.envor API keys to source control..gitignoreis configured to exclude.env, virtual environments, and the FAISS index folder. - Uploaded documents are stored temporarily during processing and removed as soon as parsing completes. Treat any deserialization options with care — avoid enabling dangerous deserialization for untrusted index files.
This project is suitable as a demo or prototype. To move to production consider:
- Extracting a service layer (FastAPI) with authentication and rate limits.
- Using a managed vector database (e.g., Milvus, Chroma Cloud) for scale.
- Adding monitoring and metrics for retrieval quality and index health.
- Encrypting on-disk data if storing sensitive documents.