A RAG-based chatbot for querying and summarizing health news articles.
LLM Orchestration: LangChain | Vector Store: ChromaDB | UI: Streamlit
Healthline Assistant uses Retrieval-Augmented Generation (RAG) with vector search to retrieve relevant health news articles and generate grounded, citation-ready responses using modern LLMs.
A fast, grounded, Healthline‑only RAG-based chatbot that builds local embeddings from provided URLs and answers strictly from those sources.
Builds a local vectorstore from user‑selected Healthline articles, then answers questions strictly using content retrieved from those sources.
Eliminates manual copy‑paste and unreliable, non‑grounded answers by constraining the model to only the supplied Healthline context.
Inspired by a conversation with a doctor neighbour who found it time‑consuming to manually copy links from Pocket into ChatGPT for summarization; this RAG solution automates ingestion, retrieval, and grounded answering exclusively from Healthline articles.
The demo video can be viewed by downloading it. It's just 6 MB in size!
Healthcare_Assistant/
├─ core/ # Backend (ingestion, chunking, embeddings, vectorstore, retrieval, LLM, QA)
│ ├─ __init__.py
│ ├─ config_loader.py # Loads config.yaml + .env and resolves paths (e.g., persist directory)
│ ├─ loader.py # Robust URL loader for Healthline content
│ ├─ chunker.py # Recursive chunk splitter with overlap
│ ├─ embeddings.py # Embedding factory (HF/SentenceTransformer + trust_remote_code support)
│ ├─ vector_store.py # Chroma lifecycle (reset persist dir; create store)
│ ├─ indexer.py # Full rebuild on new URLs; writes collection fingerprint + sources manifest
│ ├─ retrieval.py # Vectorstore rehydration + general/per-source retrievers
│ ├─ llm.py # Chat model factory (Groq or configured LLM)
│ └─ qa.py # Strictly grounded QA + per-source summarization (no external citations)
│
├─ frontend/
│ └─ ui_interface.py # Streamlit UI (dark mode, 10 URL slots, validation, grounded answers)
│
├─ config/
│ ├─ config.yaml # App configuration (models, chunking, retrieval, paths)
│ └─ .env # Secrets and runtime env (e.g., GROQ_API_KEY, EMBEDDING_MODEL)
│
├─ main.py # CLI runner: index URLs and ask questions (or summarize per article)
├─ requirements.txt # Python dependencies
└─ README.md # Project documentation
⚠️ Note: You must create a.envfile inside theconfig/folder and provide the following variables:
GROQ_API_KEYCHROMA_DIRGROQ_MODELEMBEDDING_MODEL
# 1) Python and virtual environment
python -V # recommend 3.10+
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
# 2) Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 3) Configure environment (copy and edit as needed)
# - config/.env must define at least:
# GROQ_API_KEY=<your_key>
# EMBEDDING_MODEL="Alibaba-NLP/gte-base-en-v1.5" # or another supported model
# # Optional override: CHROMA_DIR=vector_resources/vectorstore
# 4) Verify config
# - config/config.yaml is present
# - CHROMA persist directory is resolved under config/ by defaultNotes:
- The embedding loader supports models that require custom code (
trust_remote_code=True) and normalizes embeddings for cosine search when appropriate. - The vectorstore path is resolved relative to the
config/folder by default and can be overridden viaCHROMA_DIRin.env(relative paths recommended).
# From the project root
python main.pyFollow prompts:
- Paste 1–10 Healthline URLs
- Wait for indexing: old embeddings are cleared; new ones are built
- Enter a query (answers are strictly from the indexed sources)
- To summarize per article: phrase query like "summarize them separately"
# From the project root
streamlit run frontend/ui_interface.pyIn the app:
- Paste up to 10 Healthline URLs (fixed rows)
- Click "Validate & Submit URLs" to rebuild embeddings (resets vectorstore)
- Enter a query and "Submit query" for a grounded answer + Healthline source list
✅ Behavior guarantees:
- "Validate & Submit URLs" always clears the existing vectorstore and rebuilds embeddings from scratch.
- Answers are strictly grounded; if nothing relevant is retrieved, the app returns the exact fallback message.
healthline.comwww.healthline.comhttps://www.healthline.com
- Forces
httpsscheme andwww.healthline.comnetloc - Lowercases the path
- Removes query/fragment
- Collapses duplicate slashes
- Trims trailing slash (except root)
Detects duplicates across formats using canonical form (e.g., healthline.com/..., www.healthline.com/..., and https://www.healthline.com/... all resolve to one).
- 1–10 URLs per session; empty rows are ignored.
- Only validated, canonical Healthline URLs proceed to loading and embedding.
- ✅ Python 3.10+
- ✅ Streamlit (frontend UI)
- ✅ LangChain (chains, prompts, retrievers)
- ✅ ChromaDB (local vectorstore persistence)
- ✅ Sentence-Transformers / Hugging Face (embeddings, trust_remote_code support)
- ✅ Groq (LLM API integration) — can also run locally via Ollama
- ✅ Unstructured URL loader (robust web article parsing)
- ✅ python-dotenv, PyYAML (config/env management)
- 📜 Detailed module‑wise logs: Structured logging for ingestion, chunking, embedding, retrieval, and answering to simplify audits and error tracing.
- 🔗 Pocket integration: One‑click import of saved Healthline links from Pocket.
- ☁️ Cloud deployment: Dockerize and deploy on a managed platform.
- 🧪 Model experiments: Test other embeddings and LLMs for groundedness and evaluate multi‑query retrieval for complex questions.
- Healthline articles for high‑quality, clinician‑reviewed content used as the knowledge base.
- Open‑source maintainers across the LangChain, ChromaDB, Hugging Face, Streamlit, and Sentence‑Transformers ecosystems.
“All rights to the content in the provided URLs belong solely to Healthline Media LLC.”
- Re‑indexing resets prior embeddings; keep distinct sessions per topic for focused retrieval.
- Use precise, article‑aligned queries; for multi‑article tasks, the system can summarize each article separately.
- If a fact isn’t in the supplied Healthline sources, the assistant will return the exact fallback rather than hallucinate.
Expect a RAG-powered, assumption-averse approach to answers.