document-extraction

Here are 17 public repositories matching this topic...

yigitkonur / llm-based-ocr

High-accuracy PDF-to-Markdown OCR API using LLMs with vision capabilities. Features parallel processing, batching, and auto-retry logic for scalable extraction.

table-extraction pymupdf document-extraction azure-openai intelligent-document-processing gpt4-vision rag-pipeline vision-ocr complex-layout-analysis batch-ocr text-digitization

Updated Nov 29, 2025
Python

cernis-intelligence / docuglean-ocr

Star

Intelligent document processing. Extract structured data like JSON, Markdown and HTML from documents using AI.

pdf parser ocr pdf-converter developer-tools document-classification document-extraction llms

Updated Nov 17, 2025
Python

Xyntopia / pydoxtools

Star

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

python nlp pdf information-retrieval extraction document-analysis document-extraction llm chatgpt

Updated Sep 5, 2024
Python

FantDing / Image-document-extract-and-correction

Star

数字图像课程大作业，实现图片中文档提取与矫正。整体思路是通过hough变换检测出直线，进而得到角点，最后经过投影变换，进行矫正。整个项目只用到了opencv的IO操作(包括手写卷积，hough哈夫变换，投影变换等等)

affine-transformation hough-lines document-extraction

Updated Aug 7, 2020
Python

alephdata / ingest-file

Star

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

ocr excel forensics documents metadata-extraction document-extraction forensics-investigations email-forensics

Updated Dec 5, 2025
Python

ryanmcdonough / lexplore

Star

Tool to allow extraction of data from legal documents

document-extraction legal-tech generative-ai

Updated Aug 1, 2024
Python

openaleph / ingest-file

Star

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

pdf ocr email excel forensics documents metadata-extraction document-extraction file-ingestion email-forensics followthemoney openaleph

Updated Dec 2, 2025
Python

sensible-hq / tutorial-pdf-to-excel

Star

Converts a PDF file to Excel.

python pdf excel extraction document-extraction

Updated Sep 1, 2023
Python

AmmarAhm3d / invoice-gemini-extracter

Star

Invoice-Gemini-Extracter: Python tool to extract structured invoice data (fields and line items) from PDFs/images using OCR, preprocessing, and Google Gemini-powered extraction/normalization.

nlp automation ocr computer-vision pipeline information-extraction pdf-parsing table-extraction multimodal document-extraction pydantic tessaract-ocr invoice-processing google-gemini

Updated Mar 31, 2025
Python

subratamondal1 / document-extraction

Star

Document extraction from pdfs and images with OpenCV.

opencv computer-vision image-processing python3 pytorch py document-extraction

Updated Aug 20, 2024
Python

EloiRamos / dolphin-doc-extractor

Star

AI-powered document intelligence platform that extracts structured data from PDFs, Word docs, and images using Large Language Models and Tesseract OCR.

nlp ocr tesseract-ocr gradio document-extraction

Updated Oct 22, 2025
Python

PMTheTechGuy / document-entity-extractor

Star

AI-powered document extractor for names, emails, and organizations. License: MIT

python automation ai web-app pandas openai data-extraction gpt portfolio-project entity-recognition document-extraction uvicorn fastapi

Updated May 18, 2025
Python

UnstructData is a Python toolkit for extracting, transforming, and analyzing unstructured data from diverse sources like text files, logs, and documents. Key features include flexible preprocessing, data cleaning, feature extraction, and extensible utilities—ideal for streamlining messy data workflows.

etl transformers document-extraction pdf-document-processor

Updated Jun 25, 2025
Python

hreikin / pdf-toolbox

Star

Extract content from PDF's and convert or create new documents from the content in multiple output formats.

python document-conversion pandoc python3 text-extraction adobe scrapy pypandoc pymupdf document-converter document-creator document-extraction document-creation image-extraction

Updated Mar 17, 2022
Python

JunoLeong / RAG-DocExtractRAG

Star

DocExtractRAG is a Retrieval-Augmented Generation (RAG) system that combines the power of large language models (LLMs) with document retrieval to provide insightful responses based on academic or other types of documents. The system utilizes the Zephyr-7B-beta model for text generation; BAAI/bge-large-en for document embeddings.

zephyr document-extraction rag huggingface chromadb

Updated Dec 22, 2024
Python

Ritesh1137 / langchain-doc-intelligence-loader

Star

Customized LangChain Azure Document Intelligence loader for table extraction and summarization

table-extraction document-extraction document-layout-analysis azure-ai ai-engineering openai-api document-processing-pipeline generative-ai langchain langchain-python retrieval-augmentation-generation azure-ai-services

Updated Apr 30, 2024
Python

rajsinghparihar / data-detective

Star

An app that leverages LLMs to process documents, extract relevant information and provide a summary specific to financial data

ocr information-extraction document-extraction rag llms

Updated Jan 20, 2025
Python

Improve this page

Add a description, image, and links to the document-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-extraction

Here are 17 public repositories matching this topic...

yigitkonur / llm-based-ocr

cernis-intelligence / docuglean-ocr

Xyntopia / pydoxtools

FantDing / Image-document-extract-and-correction

alephdata / ingest-file

ryanmcdonough / lexplore

openaleph / ingest-file

sensible-hq / tutorial-pdf-to-excel

AmmarAhm3d / invoice-gemini-extracter

subratamondal1 / document-extraction

EloiRamos / dolphin-doc-extractor

PMTheTechGuy / document-entity-extractor

mithgx / UnstructData

hreikin / pdf-toolbox

JunoLeong / RAG-DocExtractRAG

Ritesh1137 / langchain-doc-intelligence-loader

rajsinghparihar / data-detective

Improve this page

Add this topic to your repo