Snail Mail Parser is a Python-based document processing system for converting scanned paper mail into structured digital data. It performs OCR, detects QR codes, classifies content using an LLM, and outputs structured results in YAML and Markdown formats.
- Watches a specified directory for new scanned documents
- Groups related files into document sessions based on filename patterns
- Extracts and classifies text and metadata
- Outputs structured data for further processing or archival
- File Monitoring: Watches a directory using
watchdog - Format Support:
*.png,*.jpg,*.jpeg,*.tif,*.tiff,*.pdf - Multi-page Grouping: LLM judgement based.
- OCR Engine:
pytesseract+pdf2imageorPyMuPDF - QR Detection:
pyzbar - LLM Analysis: Calls GPT-4o via OpenAI API to classify and extract fields:
senderdatesubjectdocument_typecontent_summarypayment_info(IBAN, amount, due date)suspected_qr: boolean flag
- YAML: Machine-readable structured output
- Markdown: Human-readable summary document
- Thumbnails: Optional preview images of pages
- Language: Python 3.11+
- Libraries:
watchdog,pytesseract,pyzbar,pdf2image,PyMuPDFopenai,pydantic,ruamel.yaml,markdown-it-py
- API: Optional FastAPI service for document retrieval
- Config:
.envor settings.py viapydantic-settings
- Logs all processing steps with timestamps
- Tracks file paths, OCR output, LLM prompts/responses
- Optionally run behind FastAPI with auth middleware
- Can be isolated from internet (except OpenAI endpoint)
- Clone repo and install dependencies:
pip install -r requirements.txt