ModularRAGChat is a web-based application that allows users to collect, process, and query data from various sources using a local LLM (Qwen2.5:7B via Ollama) enhanced with RAG. It supports ingestion from web searches (via DuckDuckGo), Reddit/Subreddit threads, YouTube transcripts, and files (TXT/PDF). Once data is collected into "collections," users can chat with the bot to retrieve and summarize information, with optional Ollama augmentation for content refinement.
Pros:
- Modular Design: The codebase is structured with separate utilities (e.g., web_utils.py, youtube_utils.py) for each data source, making it easy to extend or maintain. This aligns with LangChain's chain-based approach for retrieval and generation.
- Efficient Ingestion and Retrieval: Uses FAISS for vector storage, spaCy for NLP (e.g., entity extraction, chunking), and ensemble retrieval (dense + BM25) for accurate results. Supports domain-specific handling, like lyrics extraction.
- User-Friendly UI: Built with Gradio, it includes tabs for collections, chat, and admin views, with background threading for non-blocking operations.
- Local and Privacy-Focused: Runs entirely locally, avoiding cloud dependencies for core functionality, which is ideal for prototyping.
Cons:
- Scalability Limitations: Current setup uses local SQLite and FAISS, which aren't optimized for large-scale data or concurrent users. Memory constraints could arise with massive collections.
- Error Handling Gaps: While debug statements are in place, production-level robustness (e.g., retry mechanisms for web fetches or Ollama calls) is minimal, leading to occasional failures on unreliable sources like YouTube transcripts.
- Performance Overhead: Augmentation with Ollama adds processing time, and the Selenium-based YouTube scraper can be brittle due to UI changes on YouTube.
- Limited Domain Adaptation: NER via spaCy is general-purpose; it doesn't handle specialized jargon (e.g., gaming terms in PoE builds) as effectively as a fine-tuned model would.
The project is at a functional prototype stage, with core ingestion/query flows working reliably for small-to-medium datasets. It's served as a great learning tool for integrating LLMs with RAG, but it's not yet production-ready.
What I'd Do Differently for Scalability or Launching on AWS
If redesigning for scalability and AWS deployment, I'd prioritize cloud-native architecture to handle larger loads, better reliability, and easier maintenance.
- Database and Vector Store Migration: Replace local SQLite with Amazon RDS (PostgreSQL) for structured data and Pinecone or Amazon OpenSearch for vector embeddings (instead of FAISS). This enables horizontal scaling and managed backups. For RAG, I'd use LangChain's AWS integrations to sync embeddings or integrate embeddings through Bedrock for hosted models.
- Serverless Compute: Host the backend on AWS Lambda for event-driven processing (e.g., collection tasks triggered via API Gateway). The Gradio UI could run on EC2 or ECS, but for cost-efficiency, I could containerize with Docker and deploy to Fargate. This reduces idle costs and auto-scales with traffic.
- Storage and Caching: Use S3 for raw contents and consolidated files, with CloudFront CDN for fast access. Implement Redis (ElastiCache) for caching frequent queries, reducing LLM calls.
- Augmentation and LLM Optimization: Offload Ollama to SageMaker endpoints for managed inference, or switch to Bedrock for hosted models (e.g., Claude or Llama). For augmentation, batch process chunks with Step Functions to parallelize NLP/Ollama steps, cutting latency.
- Monitoring and Security: Add CloudWatch for logs/metrics, X-Ray for tracing, and IAM roles for secure access. Use Cognito for user authentication if expanding to multi-user.
- CI/CD and Testing: Set up CodePipeline for automated deployments from GitHub, with unit tests for utils (e.g., mocking web fetches) and integration tests for RAG flows.
This would make the app more robust, cost-effective, and ready for real-world use, shifting from local prototyping to a scalable service.
Future Feature Improvement Ideas
- Domain-Specific NLP Fine-Tuning: Train a custom spaCy model on labeled data for targeted domains (e.g., lyrics, gaming, finance). This would improve NER for entities like song titles or build terms in PoE, leading to better chunking, query rephrasing, and augmentation. For example, recognize "Tout Le Monde" as a song title and prioritize lyric extraction. I'd use Prodigy for annotation and integrate it into the ingestion pipeline.
- Content-Type Specific Agents: Create pluggable "agents" for different media: e.g., a lyrics agent that formats verses/choruses, a video agent with ASR (e.g., Whisper integration for non-transcripted YouTube), or a forum agent for threaded comments. This could use LangGraph for multi-step workflows, allowing dynamic routing based on source (web vs. Reddit).
- Enhanced UX Transparency: Convert debug logs into progress bars/ETAs in Gradio (e.g., estimate time based on URL count). Add a dashboard for collection stats (e.g., chunk count, augmentation status) and notifications for errors. For long-running tasks, use WebSockets for real-time updates.
- Integration Expansions: Add Google APIs for Calendar/Gmail/Tasks, using OAuth and LangChain tools to create an AI assistant (e.g., "Schedule a meeting based on email"). Modular design supports this—new utils.py files for each service, with a central orchestrator.
- Advanced RAG Features: Implement hybrid search with reranking (e.g., Cohere Rerank), multi-modal support (e.g., images from web pages via CLIP), and feedback loops for user-rated responses to fine-tune retrieval.
- Optimization and Ethics: Add rate limiting for web scrapes, ethical checks (e.g., robots.txt compliance), and compression for storage. For scalability, explore vector quantization in FAISS or sharding collections.