A Retrieval-Augmented Generation (RAG) system featuring binary quantized embeddings, Milvus Lite vector search with reranking, an LLM router, and web search fallback via Firecrawl.
- Binary Quantized Embeddings: Efficient storage and retrieval using binary vectors
- Milvus Lite Integration: Lightweight vector database for local development
- Intelligent Reranking: Enhanced retrieval quality using reciprocal rank fusion
- Multi-Agent Workflow: Router agent for quality evaluation and web search routing
- Web Search Fallback: Firecrawl integration for real-time information retrieval
- Modular Architecture: Clean separation of concerns for easy extensibility
User Query
β
1. Document Retrieval (Vector Search + Reranking)
β
2. RAG Response Generation
β
3. Router Agent (Quality Evaluation)
β
4a. SATISFACTORY β Response Synthesis
4b. UNSATISFACTORY β Web Search β Response Synthesis
β
5. Final Answer
paralegal-agent/
βββ pyproject.toml # Project metadata and dependencies
βββ config/
β βββ settings.py # Configuration via environment variables
βββ src/
β βββ embeddings/ # Embedding generation with binary quantization
β βββ indexing/ # Milvus vector database integration
β βββ retrieval/ # Vector search (no reranking in demo)
β βββ generation/ # RAG response generation
β βββ workflows/ # Multi-agent workflow orchestration
βββ examples/
β βββ test.py # Sample usage and testing
βββ data/ # Local data (e.g., PDFs, DB files)
βββ cache/ # Model/cache directories
βββ app.py # Streamlit app
βββ README.md
curl -LsSf https://astral.sh/uv/install.sh | sh
uv syncCreate a .env file in the project root or export these variables in your shell:
cp .env.example .env
# Edit .env with your API keysNote: Both keys are read by config/settings.py. Web search is optional at runtime, but the settings loader expects both keys to be present.
streamlit run app.py
# or, with uv
uv run app.pyUpload a PDF in the sidebar, add API keys, and start chatting.
import asyncio
import os
from pathlib import Path
from llama_index.core import SimpleDirectoryReader
from src.embeddings.embed_data import EmbedData
from src.indexing.milvus_vdb import MilvusVDB
from src.retrieval.retriever_rerank import Retriever
from src.generation.rag import RAG
from src.workflows.agent_workflow import EnhancedRAGWorkflow
async def quick_start():
os.environ.setdefault("OPENAI_API_KEY", "<your_key>")
os.environ.setdefault("FIRECRAWL_API_KEY", "<your_key>")
# 1) Load and split a PDF into text chunks
pdf_path = "./data/your_document.pdf"
docs = SimpleDirectoryReader(input_files=[pdf_path]).load_data()
text_chunks = [d.text for d in docs]
# 2) Create embeddings (with binary quantization)
embedder = EmbedData()
embedder.embed(text_chunks)
# 3) Setup Milvus Lite collection
vdb = MilvusVDB()
vdb.initialize_client()
vdb.create_collection()
vdb.ingest_data(embedder)
# 4) Retrieval and RAG
retriever = Retriever(vdb, embedder)
rag = RAG(retriever)
# 5) Enhanced workflow
workflow = EnhancedRAGWorkflow(retriever=retriever, rag_system=rag)
result = await workflow.run_workflow("Your question here")
print(result["answer"])
asyncio.run(quick_start())Managed in config/settings.py (via environment variables):
# Model Configuration
embedding_model = "BAAI/bge-large-en-v1.5"
llm_model = "gpt-3.5-turbo"
vector_dim = 1024
# Retrieval Configuration
top_k = 5
batch_size = 512
# Database Configuration
milvus_db_path = "./data/milvus_binary.db"
collection_name = "legal_documents"from src.embeddings.embed_data import EmbedData
embedder = EmbedData()
embedder.embed(text_chunks)from src.indexing.milvus_vdb import MilvusVDB
vdb = MilvusVDB()
vdb.initialize_client()
vdb.create_collection()
vdb.ingest_data(embedder)from src.retrieval.retriever_rerank import Retriever
retriever = Retriever(vector_db=vdb, embed_data=embedder, top_k=5)
results = retriever.search("Your query")from src.generation.rag import RAG
rag = RAG(retriever=retriever)
answer = rag.query("Your question")from src.workflows.agent_workflow import EnhancedRAGWorkflow
workflow = EnhancedRAGWorkflow(retriever=retriever, rag_system=rag)
result = await workflow.run_workflow("Complex question")- Retrieval: Binary vector search with Hamming distance
- RAG: Context construction + OpenAI completion
- Router: LLM-based quality evaluation (SATISFACTORY/UNSATISFACTORY)
- Web Search: Firecrawl web search + content extraction
- Synthesis: Combine document and web info; refine final answer
- Legal Research Assistant: Index PDFs and ask targeted questions
- Document Analysis: Identify risks or obligations in contracts
Run the included example:
python examples/test.py
# or
uv run examples/test.py- Missing API keys: Ensure
OPENAI_API_KEYandFIRECRAWL_API_KEYare set (env or.env). - Model download issues: Check/clear
cache/hf_cacheand ensure network access to Hugging Face. - Milvus Lite file locks: Stop the app/process that holds the DB, then remove
data/*.dbfiles.
Enable detailed logging:
from loguru import logger
logger.add("debug.log", level="DEBUG")- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Submit a pull request
MIT License
- LlamaIndex, Milvus, OpenAI, Firecrawl, HuggingFace
Note: This is an MVP implementation focused on core functionality. For production use, consider additional features like user authentication, rate limiting, monitoring, and robust error handling.