High-accuracy PDF-to-Markdown OCR API using LLMs with vision capabilities. Features parallel processing, batching, and auto-retry logic for scalable extraction.
-
Updated
Nov 29, 2025 - Python
High-accuracy PDF-to-Markdown OCR API using LLMs with vision capabilities. Features parallel processing, batching, and auto-retry logic for scalable extraction.
Intelligent document processing. Extract structured data like JSON, Markdown and HTML from documents using AI.
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
数字图像课程大作业,实现图片中文档提取与矫正。整体思路是通过hough变换检测出直线,进而得到角点,最后经过投影变换,进行矫正。整个项目只用到了opencv的IO操作(包括手写卷积,hough哈夫变换,投影变换等等)
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
Tool to allow extraction of data from legal documents
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
Converts a PDF file to Excel.
Invoice-Gemini-Extracter: Python tool to extract structured invoice data (fields and line items) from PDFs/images using OCR, preprocessing, and Google Gemini-powered extraction/normalization.
Document extraction from pdfs and images with OpenCV.
AI-powered document intelligence platform that extracts structured data from PDFs, Word docs, and images using Large Language Models and Tesseract OCR.
AI-powered document extractor for names, emails, and organizations. License: MIT
UnstructData is a Python toolkit for extracting, transforming, and analyzing unstructured data from diverse sources like text files, logs, and documents. Key features include flexible preprocessing, data cleaning, feature extraction, and extensible utilities—ideal for streamlining messy data workflows.
Extract content from PDF's and convert or create new documents from the content in multiple output formats.
DocExtractRAG is a Retrieval-Augmented Generation (RAG) system that combines the power of large language models (LLMs) with document retrieval to provide insightful responses based on academic or other types of documents. The system utilizes the Zephyr-7B-beta model for text generation; BAAI/bge-large-en for document embeddings.
Customized LangChain Azure Document Intelligence loader for table extraction and summarization
An app that leverages LLMs to process documents, extract relevant information and provide a summary specific to financial data
Add a description, image, and links to the document-extraction topic page so that developers can more easily learn about it.
To associate your repository with the document-extraction topic, visit your repo's landing page and select "manage topics."