Skip to content

rayraycodes/LocalNepaliDocumentChat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local Nepali Document Chat

Nepali-first document reading: upload PDFs and scanned images, run multi-engine OCR tuned for Devanagari / Nepali, then optionally clean the text with a local LLM (e.g. Gemma via Ollama) or Anthropic Claude.

Python 3.11+ License


Why this project

  • Nepali in the wild is mostly Devanagari scan noise: similar glyphs, matras, and skewed pages break generic OCR.
  • This stack combines PaddleOCR (PP-OCRv529454948 Nepali ne), EasyOCR (ne + en), and Tesseract where useful, then merges results intelligently.
  • Privacy-friendly: run OCR and post-correction on your machine; cloud APIs are optional.

Features

Area What you get
OCR Ensemble: PaddleOCR 3.x (PP-OCRv5, lang=ne) + EasyOCR + Tesseract; digital PDF text when pages are not scanned
Languages Nepali (ne) and mixed Nepali/English
LLM cleanup Configurable: Ollama (local), Anthropic, or disabled
API FastAPI — upload, job status, full OCR/NLP payload, plain text
UI React + Vite + Tailwind; proxies /api to the backend in dev
Jobs Celery + Redis for async processing; inline processing mode for simple local setups

Architecture (high level)

Upload → FastAPI → SQLite (metadata) → Celery worker OR inline thread
                    ↓
            PDF / image → preprocess → OCR ensemble → post-process
                    ↓
            Optional: LLM Devanagari correction (Ollama / Claude)

Requirements

  • Python 3.11+ (3.11 recommended; match the Docker image)
  • Node.js 20+ (for the frontend)
  • System packages (local runs):
    • Poppler (pdfinfo, etc.) — required for pdf2image / PDF rasterization
      • macOS: brew install poppler
      • Debian/Ubuntu: apt install poppler-utils
    • Tesseract with Nepali data (optional but used by the ensemble)
      • macOS: brew install tesseract tesseract-lang (ensure nep / Devanagari scripts available)

Optional:

  • Redis — if you use Celery workers (or rely on PROCESS_INLINE=true).
  • Ollama — for local correction; e.g. ollama pull gemma3:12b (or any capable model you prefer).

Quick start (local)

1. Clone and environment

git clone git@github.com:rayraycodes/LocalNepaliDocumentChat.git
cd LocalNepaliDocumentChat
cp .env.example .env
# Edit .env: OLLAMA_MODEL, CORRECTION_LLM_BACKEND, optional ANTHROPIC_API_KEY

2. Backend

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U pip
pip install -e ".[dev]"

Start the API (from repo root):

uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload
  • Set PROCESS_INLINE=true in .env if you are not running Redis + Celery (processing runs inside the API after upload).
  • With PROCESS_INLINE=false, start Redis and a worker:
celery -A backend.celery_app worker --loglevel=info

3. Frontend

cd frontend
npm install
npm run dev

Open http://localhost:5173. The Vite dev server proxies /api and /health to port 8000.

4. Ollama (optional, recommended for local correction)

ollama serve
ollama pull gemma3:12b   # or another model; set OLLAMA_MODEL in .env to match

Docker Compose

Redis, API, worker, and frontend services are defined in docker-compose.yml.

cp .env.example .env
docker compose up --build

Configuration reference

Variable Purpose
CORRECTION_LLM_BACKEND ollama, anthropic, or none
OLLAMA_BASE_URL Default http://127.0.0.1:11434
OLLAMA_MODEL Tag pulled in Ollama (e.g. gemma3:12b)
ANTHROPIC_API_KEY If using Claude for correction
PROCESS_INLINE Process jobs in the API process (no Celery)
REDIS_URL Celery broker / result backend
PADDLE_USE_DOC_ORIENTATION / PADDLE_USE_DOC_UNWARPING Trade accuracy vs. speed for Paddle preprocess

See .env.example for the full list.


API sketch

Method Path Description
GET /health Liveness
POST /api/documents/upload Multipart file upload
GET /api/documents/{job_id}/status Progress / stage
GET /api/documents/{job_id}/result OCR + NLP JSON
GET /api/documents/{job_id}/text Plain text (completed jobs)

Development

pytest
ruff check backend tests

Pre-download heavy models (optional):

python -m scripts.download_models

Contributing

Issues and pull requests are welcome. Please:

  1. Open an issue for larger changes or design questions.
  2. Keep PRs focused; match existing style and run pytest / ruff when you touch Python.

License

This project is released under the MIT License.


Acknowledgments

Built with FastAPI, PaddleOCR, EasyOCR, Tesseract, and the broader open-source ML ecosystem.

About

Chat with your document in nepali/ english

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors