Skip to content

nexla-opensource/nextract

Repository files navigation

nextract

nextract is a small, pragmatic framework for structured data extraction from files using the Pydantic AI Agent. It focuses on clean boundaries, strong typing, and JSON Schema/Pydantic-driven outputs—while keeping file handling simple and predictable.

Scope of this build

  • Uses Pydantic AI Agent only.

  • Takes local file paths and feeds content to the Agent:

    • Text files are read as text and wrapped in delimiters.
    • PDFs and images are attached as binary bytes.
    • Office docs (.doc/.docx/.ppt/.pptx) are converted to PDF first when a converter is available.
    • Excel files: .xlsx is extracted to TSV (in-process); .xls attempts CSV via LibreOffice/unoconv.
  • OCR support for scanned PDFs using Tesseract (requires system installation of Tesseract binary).

  • Automatic chunking for large documents with sentence-aware splitting and intelligent merging.

  • Returns a dict by default, or a Pydantic model instance if you pass a model and request it.

  • Tracing via structlog; usage & cost estimation from Agent usage + a simple model pricing table.


Table of Contents


Features

  • Structured extraction for small files with:

    • JSON Schema (output as dict[str, Any]), or
    • Pydantic v2 models (output as dict by default; optional model instance).
  • Pydantic AI Agent integration:

    • Raw binary attachments for PDFs/images.
    • StructuredDict for JSON Schema outputs.
    • Usage metrics retrieved from the run.
  • Batch mode runs one file per Agent call in parallel.

  • Cost estimation via a simple pricing map (optional).

  • Structlog logging to console.

  • ZIP files: extract to /tmp and process each contained file “as-is”.


What’s in / out of scope

Supported file types

  • Read Text Directly:

.txt, .md, .csv, .tsv, .xls, .xlsx, .json, .xml, .yaml, .yml, .html, .htm

  • Upload Directly (binary): Images (.png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff), PDF (.pdf)

  • ZIP: Extracted to /tmp/nextract-zip-<name> and each file inside is processed “as-is”.

  • Accepted as binary, converted to PDF before uploading to LLMs .doc, .docx, .ppt, .pptx

  • Audio: .mp3, .wav, .m4a, .ogg, .flac, .aac, .wma → Attached as binary bytes with their native audio/* MIME type for models with audio input support.

  • Video: .mp4, .webm, .mov, .avi, .mkv, .wmv → Attached as binary bytes with their native video/* MIME type for models with video input support.


Installation

# Install from PyPI
pip install nextract

# Or install from source for development
git clone https://github.com/your-username/nextract.git
cd nextract
pip install -e .[dev]

Python: 3.10+

System Dependencies

For OCR support (scanned PDFs), you need to install Tesseract OCR binary:

  • macOS (Homebrew):

    brew install tesseract
  • Ubuntu/Debian:

    sudo apt-get update && sudo apt-get install -y tesseract-ocr
  • Fedora/CentOS/RHEL:

    sudo dnf install -y tesseract
  • Windows:

Note: The Python packages (pytesseract, pdf2image, pillow) are automatically installed with nextract. Only the Tesseract binary needs manual installation.


Quick Start

JSON Schema output (default dict)

from nextract import extract

schema = {
    "title": "Invoice",
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "date": {"type": "string"},
        "total": {"type": "number"}
    },
    "required": ["invoice_number", "total"]
}

res = extract(
    files=["./docs/invoice.pdf"],
    schema_or_model=schema,
    user_prompt="Extract the invoice fields exactly as defined.",
    include_extra=True,  # adds a top-level `extra` bag for helpful unmodeled fields
)

print(res["data"])   # dict[str, Any] matching your schema (+ optional `extra`)
print(res["report"]) # model, usage, cost_estimate_usd, warnings

Pydantic model output

from pydantic import BaseModel
from nextract import extract

class Invoice(BaseModel):
    invoice_number: str
    date: str | None = None
    total: float

res = extract(
    files=["./docs/invoice.pdf"],
    schema_or_model=Invoice,
    user_prompt="Extract the invoice fields."
    # include_extra is ignored for Pydantic model mode
)

# Default behavior returns a dict
print(res["data"])  # -> {'invoice_number': '...', 'date': '...', 'total': ...}

To get the Pydantic model instance instead of a dict:

res = extract(
    files=["./docs/invoice.pdf"],
    schema_or_model=Invoice,
    user_prompt="Extract the invoice fields.",
    return_pydantic=True,
)
invoice_obj = res["data"]  # -> Invoice instance

Batch extraction (parallel)

Process each file independently (one Agent call per file):

from nextract import batch_extract

schema = {
    "title": "DocSummary",
    "type": "object",
    "properties": {"title": {"type": "string"}, "summary": {"type": "string"}},
    "required": ["title"]
}

res = batch_extract(
    batch=["./a.pdf", "./b.png", "./c.txt"],   # or [["./a1.pdf","./a2.pdf"], ["./b1.pdf"]] to group
    schema_or_model=schema,
    user_prompt="Summarize each document with title + summary.",
    include_extra=False,
    max_concurrency=4,
)

# result is a dict keyed by the first file in each item
print(res.keys())  # -> {"./a.pdf", "./b.png", "./c.txt"}

CLI

nextract extract vs nextract batch

nextract extract - Processes all files together in a single AI agent run:

  • Use when you want to extract information from multiple files as a cohesive unit
  • Returns one structured data object for all files combined
  • Example: Extract information from a contract and its amendments together

nextract batch - Processes each file independently in parallel:

  • Use when you want to extract structured data from each file individually
  • Returns one result per file, keyed by filename
  • Faster for multiple files due to parallel processing (configurable concurrency)
  • Example: Extract invoice data from 100 separate invoice PDFs
# JSON Schema - single extraction run
nextract extract ./invoice.pdf ./amendment.pdf \
  --schema ./invoice.schema.json \
  --prompt "Extract the invoice fields." \
  --include-extra

# Pydantic model (module:Class or module.Class) - single extraction run
nextract extract ./invoice.pdf \
  --pydantic-model mypkg.models:Invoice

# Batch (parallel) - one extraction run per file
nextract batch ./a.pdf ./b.png ./c.txt \
  --schema ./summary.schema.json \
  --prompt "Summarize each document." \
  --max-concurrency 4

Run nextract --help, nextract extract --help, or nextract batch --help for more.


Configuration

Environment variables

Variable Default Description
NEXTRACT_MODEL openai:gpt-4o Pydantic AI model string (provider:model-id).
NEXTRACT_MAX_CONCURRENCY 4 Max parallel Agent calls in batch_extract.
NEXTRACT_MAX_RUN_RETRIES 5 Max retry attempts around Agent runs.
NEXTRACT_PER_CALL_TIMEOUT_SECS 120 Per-call timeout in seconds.
NEXTRACT_PRICING (unset) JSON map for cost estimation (see below).
NEXTRACT_MAX_VALIDATION_ROUNDS 2 Max schema-enforced output validation retries.

Also set provider credentials as expected by Pydantic AI for your chosen provider. Example: for OpenAI: OPENAI_API_KEY=...

Pricing configuration

NEXTRACT_PRICING expects a JSON string like:

{
  "openai:gpt-4o": { "input_per_1k": 0.005, "output_per_1k": 0.015 },
  "openai:gpt-4.1-mini": { "input_per_1k": 0.003, "output_per_1k": 0.006 }
}

This is used to compute cost_estimate_usd from the Agent’s token usage. If the current model is missing in this map, cost will be null.

Model selection

By default, nextract uses openai:gpt-4o (vision-capable). Choose a model by:

  • Environment variable:

    export NEXTRACT_MODEL="provider:model-id"
  • Python argument override (takes precedence over env):

    from nextract import extract, batch_extract
    
    extract(["./invoice.pdf"], schema_or_model=my_schema, model="openai:gpt-4o")
    batch_extract([["a.pdf"],["b.png"]], schema_or_model=my_schema, model="anthropic:claude-3-7-sonnet")
  • CLI flag (takes precedence over env):

    nextract extract ./invoice.pdf --schema schema.json --model openai:gpt-4o
    nextract batch ./a.pdf ./b.png --schema schema.json --model anthropic:claude-3-7-sonnet

You can still construct and pass a RuntimeConfig if you need to tune concurrency, retries, or timeouts.


How it works

  1. You pass file paths (single or multiple).

  2. nextract prepares content:

    • Textual files are read and injected as-is into the prompt, wrapped by:

      --- BEGIN FILE: <name> (mime) ---
      <file contents>
      --- END FILE: <name> ---
      
    • Binary files (PDFs, images, Office docs, others) are attached as binary parts using Pydantic AI’s BinaryContent.

  3. An Agent is created with:

    • a system prompt that instructs strict, schema-aligned extraction,

    • an output_type of either:

      • StructuredDict(JSON Schema) → outputs a dict, or
      • Your Pydantic Model → outputs a model instance (dumped to dict by default).
  4. For JSON Schema mode, a jsonschema validator runs as an output validator. On failure, the Agent is asked to retry briefly (limited rounds).

  5. The result is validated again before returning. You get:

    • data (dict by default),
    • report with usage and optional cost estimate.

File type handling

  • Text: .txt, .md, .json, .yaml, .yml, .xml, .csv, .tsv, .html, .htm → Read as text (UTF‑8 with fallback), injected verbatim with file delimiters.
  • Excel: .xlsx (parsed to TSV via in‑process XML), .xls (CSV via CLI if available; else raw fallback) → Read as text and injected like other textual files. Best‑effort extraction (no styling/formatting).
  • PDF / Images: .pdf, .png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff → Attached as binary bytes for the model (vision-capable models recommended).
  • Audio: .mp3, .wav, .m4a, .ogg, .flac, .aac, .wma → Attached as binary bytes with native audio/* MIME type.
  • Video: .mp4, .webm, .mov, .avi, .mkv, .wmv → Attached as binary bytes with native video/* MIME type.
  • Office docs: .doc, .docx, .ppt, .pptx → Converted to PDF via LibreOffice/soffice or unoconv if available; on failure, attached as original binary.
  • ZIP: Extracted to /tmp/nextract-zip-<name>; each inner file is processed as above. No nested recursion.

OCR Support: Scanned PDFs are automatically detected and processed using Tesseract OCR. Requires Tesseract binary to be installed (see Installation).


Office → PDF conversion

nextract attempts to convert .doc/.docx/.ppt/.pptx to PDF using system tools. These are external dependencies and are not installed via pip.

  • Preferred: soffice (LibreOffice) in headless mode.
  • Fallback: unoconv (uses LibreOffice UNO).

Installation hints:

  • macOS (Homebrew):
    • brew install --cask libreoffice
    • Ensure soffice is on your PATH. If not, you can symlink:
      • ln -s "/Applications/LibreOffice.app/Contents/MacOS/soffice" /usr/local/bin/soffice (adjust for Apple Silicon/Homebrew prefix)
  • Ubuntu/Debian:
    • sudo apt-get update && sudo apt-get install -y libreoffice
    • Optional: sudo apt-get install -y unoconv
  • Fedora/CentOS/RHEL:
    • sudo dnf install -y libreoffice (or yum on older systems)
    • Optional: install unoconv from your distro repos if available.
  • Windows:

If neither tool is found, nextract logs a warning and falls back to attaching the original Office binary.

Examples & Few-shot Hints

You can supply examples to guide the model:

Programmatic (examples argument):

  • Output-only examples: list[dict]
  • Paired input/output: list[tuple[str | None, dict]]

CLI (--examples JSON file):

  • Output-only examples:

    [
      { "invoice_number": "INV-001", "total": 123.45 }
    ]
  • Paired input/output (use a two-element array):

    [
      ["Item: Widget A, Total: 123.45", { "invoice_number": "INV-001", "total": 123.45 }]
    ]

“Extra” fields (JSON Schema mode): If you pass include_extra=True, your schema is augmented with a top-level:

"extra": { "type": "object", "additionalProperties": true }

so the model can place relevant-but-unspecified fields there.


Return shape

All entry points return a dict with this structure:

{
  "data": { /* your structured result (dict by default) */ },
  "report": {
    "model": "provider:model-id",
    "files": ["..."],
    "usage": {
      "requests": 1,
      "tool_calls": 0,
      "input_tokens": 123,
      "output_tokens": 456,
      "details": { /* provider-dependent */ }
    },
    "cost_estimate_usd": 0.0123,
    "warnings": []
  }
}
  • In Pydantic model mode, data is still a dict unless you passed return_pydantic=True, in which case it’s the model instance.

Logging & Tracing

  • Uses structlog; logs are JSON-formatted to stdout.
  • Each extraction logs: model, files, usage, warnings, and cost estimate.
  • You can set up your own logging before calling extract/batch_extract. By default, the library initializes logging for you (toggle via setup_logs=False).

Retries, Rate Limits & Timeouts

  • Each Agent call is wrapped with exponential backoff (max attempts from NEXTRACT_MAX_RUN_RETRIES).
  • Timeout per call is NEXTRACT_PER_CALL_TIMEOUT_SECS (default 120s).
  • In batch mode, up to NEXTRACT_MAX_CONCURRENCY tasks run in parallel (default 4).

Large files (TODO)

Planned design (not implemented in this build):

  • Chunking (semantic/page) for large inputs.
  • Per-chunk extraction with all fields optional, then merge into a full model validated against the target schema/model.
  • Pluggable conflict resolution & optional provenance.

Limitations

  • No readability parsing for HTML.
  • OCR requires Tesseract binary to be installed separately (Python packages are included).
  • Office conversions require soffice (LibreOffice) or unoconv installed; otherwise we fall back to attaching the original binary.
  • Office file understanding depends on the model/provider.
  • Very large inputs may exceed model or provider limits.
  • ZIP extraction writes to /tmp/nextract-zip-<name>; these temp files are not auto-deleted by the library.

FAQ

Q: Which providers/models can I use? A: Any supported by Pydantic AI Agent. Select via NEXTRACT_MODEL="provider:model-id" and set the provider’s expected credentials (e.g., OPENAI_API_KEY).

Q: What happens if schema validation keeps failing? A: The Agent is asked to retry a couple of times. Final results are validated once more; if still invalid, you’ll see a final_validation_error entry under report.warnings.

Q: Can I store or inspect attachments that nextract sends? A: This build sends raw text or binary bytes directly to the Agent. If you need durable storage or redaction, wrap nextract in your own pipeline.

Q: Can I get a Pydantic model out? A: Yes—pass your model class to schema_or_model and set return_pydantic=True.


Development

Building & Testing

# Install development dependencies
pip install -e .[dev]

# Run tests
pytest

# Run linting
ruff check nextract

# Build package
python -m build

# Test installation
pip install dist/*.whl

CI/CD

This project uses GitHub Actions for continuous integration and automated PyPI publishing:

  • CI: Runs on every push/PR, testing across Python 3.10-3.12
  • Release: Automatically publishes to PyPI when GitHub releases are created
  • Versioning: Managed statically in pyproject.toml (current: 0.0.1)

Creating a Release

  1. Bump version in pyproject.toml, commit, and push.
  2. Create a GitHub release (with notes) — this triggers automatic PyPI publishing.

License

MIT. Feel free to adapt and extend.


Project Structure (for reference)

nextract/
  ├─ nextract/
  │  ├─ __init__.py            # exports extract, batch_extract
  │  ├─ version.py
  │  ├─ config.py              # RuntimeConfig (model, concurrency, timeouts, pricing)
  │  ├─ logging.py             # structlog setup
  │  ├─ mimetypes_map.py       # simple mapping & helpers
  │  ├─ schema.py              # JSON Schema/Pydantic utilities
  │  ├─ prompts.py             # system prompt + examples builder
  │  ├─ files.py               # read-as-is; BinaryContent or text
  │  ├─ pricing.py             # usage → cost estimate
  │  ├─ agent_runner.py        # Agent wiring, retries, validation, metrics
  │  ├─ core.py                # public API: extract, batch_extract
  │  └─ cli.py                 # Typer CLI
  └─ pyproject.toml

About

A lightweight Python framework for extracting structured data from files using Pydantic AI Agent. Supports PDFs, images, documents, and more with JSON Schema/Pydantic outputs, batch processing, and cost estimation.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages