nextract is a small, pragmatic framework for structured data extraction from files using the Pydantic AI Agent. It focuses on clean boundaries, strong typing, and JSON Schema/Pydantic-driven outputs—while keeping file handling simple and predictable.
Scope of this build
Uses Pydantic AI Agent only.
Takes local file paths and feeds content to the Agent:
- Text files are read as text and wrapped in delimiters.
- PDFs and images are attached as binary bytes.
- Office docs (
.doc/.docx/.ppt/.pptx) are converted to PDF first when a converter is available.- Excel files:
.xlsxis extracted to TSV (in-process);.xlsattempts CSV via LibreOffice/unoconv.OCR support for scanned PDFs using Tesseract (requires system installation of Tesseract binary).
Automatic chunking for large documents with sentence-aware splitting and intelligent merging.
Returns a
dictby default, or a Pydantic model instance if you pass a model and request it.Tracing via structlog; usage & cost estimation from Agent usage + a simple model pricing table.
-
Structured extraction for small files with:
- JSON Schema (output as
dict[str, Any]), or - Pydantic v2 models (output as dict by default; optional model instance).
- JSON Schema (output as
-
Pydantic AI Agent integration:
- Raw binary attachments for PDFs/images.
- StructuredDict for JSON Schema outputs.
- Usage metrics retrieved from the run.
-
Batch mode runs one file per Agent call in parallel.
-
Cost estimation via a simple pricing map (optional).
-
Structlog logging to console.
-
ZIP files: extract to
/tmpand process each contained file “as-is”.
Supported file types
- Read Text Directly:
.txt, .md, .csv, .tsv, .xls, .xlsx, .json, .xml, .yaml, .yml, .html, .htm
-
Upload Directly (binary): Images (
.png,.jpg,.jpeg,.webp,.gif,.bmp,.tiff), PDF (.pdf) -
ZIP: Extracted to
/tmp/nextract-zip-<name>and each file inside is processed “as-is”. -
Accepted as binary, converted to PDF before uploading to LLMs
.doc,.docx,.ppt,.pptx -
Audio:
.mp3,.wav,.m4a,.ogg,.flac,.aac,.wma→ Attached as binary bytes with their nativeaudio/*MIME type for models with audio input support. -
Video:
.mp4,.webm,.mov,.avi,.mkv,.wmv→ Attached as binary bytes with their nativevideo/*MIME type for models with video input support.
# Install from PyPI
pip install nextract
# Or install from source for development
git clone https://github.com/your-username/nextract.git
cd nextract
pip install -e .[dev]Python: 3.10+
For OCR support (scanned PDFs), you need to install Tesseract OCR binary:
-
macOS (Homebrew):
brew install tesseract
-
Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y tesseract-ocr -
Fedora/CentOS/RHEL:
sudo dnf install -y tesseract
-
Windows:
- Download installer from GitHub Tesseract releases
- Add Tesseract to your system PATH
Note: The Python packages (
pytesseract,pdf2image,pillow) are automatically installed with nextract. Only the Tesseract binary needs manual installation.
from nextract import extract
schema = {
"title": "Invoice",
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string"},
"total": {"type": "number"}
},
"required": ["invoice_number", "total"]
}
res = extract(
files=["./docs/invoice.pdf"],
schema_or_model=schema,
user_prompt="Extract the invoice fields exactly as defined.",
include_extra=True, # adds a top-level `extra` bag for helpful unmodeled fields
)
print(res["data"]) # dict[str, Any] matching your schema (+ optional `extra`)
print(res["report"]) # model, usage, cost_estimate_usd, warningsfrom pydantic import BaseModel
from nextract import extract
class Invoice(BaseModel):
invoice_number: str
date: str | None = None
total: float
res = extract(
files=["./docs/invoice.pdf"],
schema_or_model=Invoice,
user_prompt="Extract the invoice fields."
# include_extra is ignored for Pydantic model mode
)
# Default behavior returns a dict
print(res["data"]) # -> {'invoice_number': '...', 'date': '...', 'total': ...}To get the Pydantic model instance instead of a dict:
res = extract(
files=["./docs/invoice.pdf"],
schema_or_model=Invoice,
user_prompt="Extract the invoice fields.",
return_pydantic=True,
)
invoice_obj = res["data"] # -> Invoice instanceProcess each file independently (one Agent call per file):
from nextract import batch_extract
schema = {
"title": "DocSummary",
"type": "object",
"properties": {"title": {"type": "string"}, "summary": {"type": "string"}},
"required": ["title"]
}
res = batch_extract(
batch=["./a.pdf", "./b.png", "./c.txt"], # or [["./a1.pdf","./a2.pdf"], ["./b1.pdf"]] to group
schema_or_model=schema,
user_prompt="Summarize each document with title + summary.",
include_extra=False,
max_concurrency=4,
)
# result is a dict keyed by the first file in each item
print(res.keys()) # -> {"./a.pdf", "./b.png", "./c.txt"}nextract extract - Processes all files together in a single AI agent run:
- Use when you want to extract information from multiple files as a cohesive unit
- Returns one structured data object for all files combined
- Example: Extract information from a contract and its amendments together
nextract batch - Processes each file independently in parallel:
- Use when you want to extract structured data from each file individually
- Returns one result per file, keyed by filename
- Faster for multiple files due to parallel processing (configurable concurrency)
- Example: Extract invoice data from 100 separate invoice PDFs
# JSON Schema - single extraction run
nextract extract ./invoice.pdf ./amendment.pdf \
--schema ./invoice.schema.json \
--prompt "Extract the invoice fields." \
--include-extra
# Pydantic model (module:Class or module.Class) - single extraction run
nextract extract ./invoice.pdf \
--pydantic-model mypkg.models:Invoice
# Batch (parallel) - one extraction run per file
nextract batch ./a.pdf ./b.png ./c.txt \
--schema ./summary.schema.json \
--prompt "Summarize each document." \
--max-concurrency 4Run
nextract --help,nextract extract --help, ornextract batch --helpfor more.
| Variable | Default | Description |
|---|---|---|
NEXTRACT_MODEL |
openai:gpt-4o |
Pydantic AI model string (provider:model-id). |
NEXTRACT_MAX_CONCURRENCY |
4 |
Max parallel Agent calls in batch_extract. |
NEXTRACT_MAX_RUN_RETRIES |
5 |
Max retry attempts around Agent runs. |
NEXTRACT_PER_CALL_TIMEOUT_SECS |
120 |
Per-call timeout in seconds. |
NEXTRACT_PRICING |
(unset) | JSON map for cost estimation (see below). |
NEXTRACT_MAX_VALIDATION_ROUNDS |
2 |
Max schema-enforced output validation retries. |
Also set provider credentials as expected by Pydantic AI for your chosen provider. Example: for OpenAI:
OPENAI_API_KEY=...
NEXTRACT_PRICING expects a JSON string like:
{
"openai:gpt-4o": { "input_per_1k": 0.005, "output_per_1k": 0.015 },
"openai:gpt-4.1-mini": { "input_per_1k": 0.003, "output_per_1k": 0.006 }
}This is used to compute cost_estimate_usd from the Agent’s token usage. If the current model is missing in this map, cost will be null.
By default, nextract uses openai:gpt-4o (vision-capable). Choose a model by:
-
Environment variable:
export NEXTRACT_MODEL="provider:model-id"
-
Python argument override (takes precedence over env):
from nextract import extract, batch_extract extract(["./invoice.pdf"], schema_or_model=my_schema, model="openai:gpt-4o") batch_extract([["a.pdf"],["b.png"]], schema_or_model=my_schema, model="anthropic:claude-3-7-sonnet")
-
CLI flag (takes precedence over env):
nextract extract ./invoice.pdf --schema schema.json --model openai:gpt-4o nextract batch ./a.pdf ./b.png --schema schema.json --model anthropic:claude-3-7-sonnet
You can still construct and pass a RuntimeConfig if you need to tune concurrency, retries, or timeouts.
-
You pass file paths (single or multiple).
-
nextractprepares content:-
Textual files are read and injected as-is into the prompt, wrapped by:
--- BEGIN FILE: <name> (mime) --- <file contents> --- END FILE: <name> --- -
Binary files (PDFs, images, Office docs, others) are attached as binary parts using Pydantic AI’s
BinaryContent.
-
-
An Agent is created with:
-
a system prompt that instructs strict, schema-aligned extraction,
-
an
output_typeof either:- StructuredDict(JSON Schema) → outputs a dict, or
- Your Pydantic Model → outputs a model instance (dumped to dict by default).
-
-
For JSON Schema mode, a jsonschema validator runs as an output validator. On failure, the Agent is asked to retry briefly (limited rounds).
-
The result is validated again before returning. You get:
data(dict by default),reportwith usage and optional cost estimate.
- Text:
.txt,.md,.json,.yaml,.yml,.xml,.csv,.tsv,.html,.htm→ Read as text (UTF‑8 with fallback), injected verbatim with file delimiters. - Excel:
.xlsx(parsed to TSV via in‑process XML),.xls(CSV via CLI if available; else raw fallback) → Read as text and injected like other textual files. Best‑effort extraction (no styling/formatting). - PDF / Images:
.pdf,.png,.jpg,.jpeg,.webp,.gif,.bmp,.tiff→ Attached as binary bytes for the model (vision-capable models recommended). - Audio:
.mp3,.wav,.m4a,.ogg,.flac,.aac,.wma→ Attached as binary bytes with nativeaudio/*MIME type. - Video:
.mp4,.webm,.mov,.avi,.mkv,.wmv→ Attached as binary bytes with nativevideo/*MIME type. - Office docs:
.doc,.docx,.ppt,.pptx→ Converted to PDF via LibreOffice/soffice or unoconv if available; on failure, attached as original binary. - ZIP: Extracted to
/tmp/nextract-zip-<name>; each inner file is processed as above. No nested recursion.
OCR Support: Scanned PDFs are automatically detected and processed using Tesseract OCR. Requires Tesseract binary to be installed (see Installation).
nextract attempts to convert .doc/.docx/.ppt/.pptx to PDF using system tools. These are external dependencies and are not installed via pip.
- Preferred:
soffice(LibreOffice) in headless mode. - Fallback:
unoconv(uses LibreOffice UNO).
Installation hints:
- macOS (Homebrew):
brew install --cask libreoffice- Ensure
sofficeis on yourPATH. If not, you can symlink:ln -s "/Applications/LibreOffice.app/Contents/MacOS/soffice" /usr/local/bin/soffice(adjust for Apple Silicon/Homebrew prefix)
- Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y libreoffice- Optional:
sudo apt-get install -y unoconv
- Fedora/CentOS/RHEL:
sudo dnf install -y libreoffice(oryumon older systems)- Optional: install
unoconvfrom your distro repos if available.
- Windows:
- Install LibreOffice from https://www.libreoffice.org/download/ and add
soffice.exeto yourPATH.
- Install LibreOffice from https://www.libreoffice.org/download/ and add
If neither tool is found, nextract logs a warning and falls back to attaching the original Office binary.
You can supply examples to guide the model:
Programmatic (examples argument):
- Output-only examples:
list[dict] - Paired input/output:
list[tuple[str | None, dict]]
CLI (--examples JSON file):
-
Output-only examples:
[ { "invoice_number": "INV-001", "total": 123.45 } ] -
Paired input/output (use a two-element array):
[ ["Item: Widget A, Total: 123.45", { "invoice_number": "INV-001", "total": 123.45 }] ]
“Extra” fields (JSON Schema mode):
If you pass include_extra=True, your schema is augmented with a top-level:
"extra": { "type": "object", "additionalProperties": true }so the model can place relevant-but-unspecified fields there.
All entry points return a dict with this structure:
{
"data": { /* your structured result (dict by default) */ },
"report": {
"model": "provider:model-id",
"files": ["..."],
"usage": {
"requests": 1,
"tool_calls": 0,
"input_tokens": 123,
"output_tokens": 456,
"details": { /* provider-dependent */ }
},
"cost_estimate_usd": 0.0123,
"warnings": []
}
}- In Pydantic model mode,
datais still a dict unless you passedreturn_pydantic=True, in which case it’s the model instance.
- Uses structlog; logs are JSON-formatted to stdout.
- Each extraction logs: model, files, usage, warnings, and cost estimate.
- You can set up your own logging before calling
extract/batch_extract. By default, the library initializes logging for you (toggle viasetup_logs=False).
- Each Agent call is wrapped with exponential backoff (max attempts from
NEXTRACT_MAX_RUN_RETRIES). - Timeout per call is
NEXTRACT_PER_CALL_TIMEOUT_SECS(default 120s). - In batch mode, up to
NEXTRACT_MAX_CONCURRENCYtasks run in parallel (default 4).
Planned design (not implemented in this build):
- Chunking (semantic/page) for large inputs.
- Per-chunk extraction with all fields optional, then merge into a full model validated against the target schema/model.
- Pluggable conflict resolution & optional provenance.
- No readability parsing for HTML.
- OCR requires Tesseract binary to be installed separately (Python packages are included).
- Office conversions require
soffice(LibreOffice) orunoconvinstalled; otherwise we fall back to attaching the original binary. - Office file understanding depends on the model/provider.
- Very large inputs may exceed model or provider limits.
- ZIP extraction writes to
/tmp/nextract-zip-<name>; these temp files are not auto-deleted by the library.
Q: Which providers/models can I use?
A: Any supported by Pydantic AI Agent. Select via NEXTRACT_MODEL="provider:model-id" and set the provider’s expected credentials (e.g., OPENAI_API_KEY).
Q: What happens if schema validation keeps failing?
A: The Agent is asked to retry a couple of times. Final results are validated once more; if still invalid, you’ll see a final_validation_error entry under report.warnings.
Q: Can I store or inspect attachments that nextract sends?
A: This build sends raw text or binary bytes directly to the Agent. If you need durable storage or redaction, wrap nextract in your own pipeline.
Q: Can I get a Pydantic model out?
A: Yes—pass your model class to schema_or_model and set return_pydantic=True.
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest
# Run linting
ruff check nextract
# Build package
python -m build
# Test installation
pip install dist/*.whlThis project uses GitHub Actions for continuous integration and automated PyPI publishing:
- CI: Runs on every push/PR, testing across Python 3.10-3.12
- Release: Automatically publishes to PyPI when GitHub releases are created
- Versioning: Managed statically in
pyproject.toml(current:0.0.1)
- Bump
versioninpyproject.toml, commit, and push. - Create a GitHub release (with notes) — this triggers automatic PyPI publishing.
MIT. Feel free to adapt and extend.
nextract/
├─ nextract/
│ ├─ __init__.py # exports extract, batch_extract
│ ├─ version.py
│ ├─ config.py # RuntimeConfig (model, concurrency, timeouts, pricing)
│ ├─ logging.py # structlog setup
│ ├─ mimetypes_map.py # simple mapping & helpers
│ ├─ schema.py # JSON Schema/Pydantic utilities
│ ├─ prompts.py # system prompt + examples builder
│ ├─ files.py # read-as-is; BinaryContent or text
│ ├─ pricing.py # usage → cost estimate
│ ├─ agent_runner.py # Agent wiring, retries, validation, metrics
│ ├─ core.py # public API: extract, batch_extract
│ └─ cli.py # Typer CLI
└─ pyproject.toml