An IR-first, extensible document compiler for AI systems.
This is NOT a PDF-to-Markdown script. It is a production-grade document ingestion and canonicalization engine designed with compiler-like architecture: Input → IR → Backends.
Think like a compiler engineer:
- Input Layer: Format-specific parsers (currently PDF via Docling)
- AST/IR: Canonical intermediate representation with strict schema
- Backends: Multiple export formats (Markdown, Text, Parquet)
┌─────────────────────────────────────────┐
│ Input Adapter Layer │
│ Format-specific parsing only │
└────────────────────┬────────────────────┘
│
┌────────────────────▼────────────────────┐
│ Extraction Layer │
│ Extract raw structural elements │
└────────────────────┬────────────────────┘
│
┌────────────────────▼────────────────────┐
│ Normalization Layer │
│ Convert to canonical IR with hashing │
└────────────────────┬────────────────────┘
│
┌────────────────────▼────────────────────┐
│ Canonical IR Layer │
│ Typed schema, stable IDs, relationships│
└────────────────────┬────────────────────┘
│
┌────────────────────▼────────────────────┐
│ Export Layer │
│ Markdown, Text, Parquet, Assets │
└─────────────────────────────────────────┘
- Hash-based stable IDs (document, block, table, image, chunk)
- Running pipeline twice produces identical output
- No UUIDs, no randomness
Document
├── document_id: str (hash-based)
├── schema_version: str
├── parser_version: str
├── metadata: DocumentMetadata
├── blocks: List[Block]
│ ├── block_id: str (deterministic)
│ ├── type: BlockType (heading, paragraph, table, image, etc.)
│ ├── content: str
│ ├── page_number: int
│ ├── bbox: BoundingBox
│ └── metadata: dict
└── relationships: List[Relationship]SemanticSectionChunker: Section-based (headings)TokenWindowChunker: Fixed token windows with overlapLayoutAwareChunker: Layout-aware (stub)
All chunking operates on IR, not raw text.
- Markdown: Human-readable with formatting
- Plain Text: Simple text extraction
- Parquet: Efficient structured storage for tables/blocks
- Assets: Extracted images (PNG) and tables (CSV)
/<document_id>/
manifest.json # Processing metadata
ir.json # Canonical IR
chunks.json # Chunk definitions
/assets/
/images/ # Extracted images
/tables/ # Tables as CSV
/exports/
/markdown/ # Markdown output
/text/ # Plain text output
/parquet/ # Parquet datasets
/logs/ # Processing logs
IMPORTANT: LayoutIR requires PyTorch with CUDA 13.0 support for GPU acceleration. Install PyTorch first:
# Step 1: Install PyTorch with CUDA 13.0 (REQUIRED)
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
# Step 2: Install LayoutIR
pip install layoutir# Install from source
git clone https://github.com/RahulPatnaik/layoutir.git
cd layoutir
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install -e .Note: The package intentionally does not include PyTorch in its base dependencies to ensure you get the correct CUDA version. Any existing PyTorch installation will be overwritten by the CUDA 13.0 version.
# Using the CLI
layoutir --input file.pdf --output ./out
# Or using Python directly
python -m layoutir.cli --input file.pdf --output ./out# Semantic chunking (default)
layoutir --input file.pdf --output ./out --chunk-strategy semantic
# Token-based chunking with custom size
layoutir --input file.pdf --output ./out \
--chunk-strategy token \
--chunk-size 1024 \
--chunk-overlap 128
# Enable GPU acceleration
layoutir --input file.pdf --output ./out --use-gpu
# Debug mode with structured logging
layoutir --input file.pdf --output ./out \
--log-level DEBUG \
--structured-logsfrom pathlib import Path
from layoutir import Pipeline
from layoutir.adapters import DoclingAdapter
from layoutir.chunking import SemanticSectionChunker
# Create pipeline
adapter = DoclingAdapter(use_gpu=True)
chunker = SemanticSectionChunker(max_heading_level=2)
pipeline = Pipeline(adapter=adapter, chunk_strategy=chunker)
# Process document
document = pipeline.process(
input_path=Path("document.pdf"),
output_dir=Path("./output")
)
# Access results
print(f"Extracted {len(document.blocks)} blocks")
print(f"Document ID: {document.document_id}")src/layoutir/
├── schema.py # Canonical IR schema (Pydantic)
├── pipeline.py # Main orchestrator
│
├── adapters/ # Input adapters
│ ├── base.py # Abstract interface
│ └── docling_adapter.py # PDF via Docling
│
├── extraction/ # Raw element extraction
│ └── docling_extractor.py
│
├── normalization/ # IR normalization
│ └── normalizer.py
│
├── chunking/ # Chunking strategies
│ └── strategies.py
│
├── exporters/ # Export backends
│ ├── markdown_exporter.py
│ ├── text_exporter.py
│ ├── parquet_exporter.py
│ └── asset_writer.py
│
└── utils/
├── hashing.py # Deterministic ID generation
└── logging_config.py # Structured logging
ingest.py # CLI entrypoint
benchmark.py # Performance benchmark
test_pipeline.py # Integration test
- Strict layer separation
- Deterministic processing
- Schema validation
- Pluggable strategies
- Observability/timing
- Efficient storage (Parquet)
- Mix business logic into adapters
- Hardcode paths or configurations
- Use non-deterministic IDs (UUIDs)
- Combine IR and export logic
- Skip schema validation
- Load entire files into memory unnecessarily
- Implement
InputAdapterinterface:
class DocxAdapter(InputAdapter):
def parse(self, file_path: Path) -> Any: ...
def supports_format(self, file_path: Path) -> bool: ...
def get_parser_version(self) -> str: ...- Implement corresponding extractor
- Update pipeline to use new adapter
class CustomChunker(ChunkStrategy):
def chunk(self, document: Document) -> List[Chunk]:
# Operate on IR blocks
...class JsonExporter(Exporter):
def export(self, document: Document, output_dir: Path, chunks: List[Chunk]):
# Export from canonical IR
...Designed to handle 200+ page PDFs efficiently:
- Streaming processing where possible
- Lazy loading of heavy dependencies
- GPU acceleration support
- Parallel export operations
- Efficient Parquet storage for tables
- Structured JSON logging
- Stage-level timing metrics
- Extraction statistics
- Deterministic output for debugging
Current schema version: 1.0.0
Future schema changes will be tracked via semantic versioning:
- Major: Breaking changes to IR structure
- Minor: Backwards-compatible additions
- Patch: Bug fixes
- DOCX input adapter
- HTML input adapter
- Advanced layout-aware chunking
- Parallel page processing
- Incremental updates (only reprocess changed pages)
- Vector embeddings export
- OCR fallback for scanned PDFs
See project root for license information.
This is a research/prototype phase project. See main project README for contribution guidelines.