GitHub - stalinrod/financial-document-parser: Production-grade PDF document parser with coordinate-based extraction, vendor auto-detection, confidence scoring, and batch processing.

Getting Started# Financial Document Parser

A production-grade desktop application for automated extraction, validation, and export of structured data from PDF financial documents. Features a coordinate-based extraction engine, intelligent vendor auto-detection, multi-level confidence scoring, batch processing, and a full desktop GUI.

Originally built for a German financial services client to process vendor invoices at scale. Designed with extensible architecture to support additional document formats and locales.

Features

Coordinate-Based Extraction Engine — Spatially-aware field extraction using configurable coordinate zones per document template, not fragile regex patterns. Supports text, numeric, date, and currency field types with fallback zones and validation patterns.
Automatic Vendor Detection — Multi-layered analysis (keyword matching, pattern recognition, document structure) to identify document source and select the correct extraction template automatically.
Multi-Factor Confidence Scoring — Evaluates extraction quality based on data completeness, format validation, vendor-specific patterns, and consistency checks. Outputs a confidence level (low / medium / high) with field-level scores and actionable recommendations.
Batch Processing — Multi-threaded processing engine with job queue management, memory monitoring, progress tracking, and comprehensive error recovery. Supports priority queuing and template overrides per job.
Desktop GUI — Full tkinter interface with dark/light themes, PDF viewer, results display, batch queue viewer, template editor, coordinate zone editor, data quality dashboard, and export dialogs.
WCAG Accessibility — Built-in accessibility enhancements and compliance for the desktop interface.
Configurable Template System — Vendor-specific extraction templates with a visual template editor and template library manager. Easy to extend for new document formats.
Data Export — CSV export with configurable formatting, export preview, and localized output.
Production Infrastructure — Structured logging (structlog), error recovery, performance profiling, security scanning (bandit/safety), type checking (mypy), and a comprehensive test suite (pytest).

Architecture

src/
├── main.py                  # CLI entry point (single file, batch, or GUI mode)
├── production_app.py        # Production application wrapper
├── core/                    # Application lifecycle and orchestration
│   ├── app.py               # Main app controller and processing pipeline
│   ├── config.py            # Configuration management
│   ├── settings_manager.py  # Runtime settings
│   ├── batch_processor_advanced.py  # Multi-threaded batch engine
│   ├── batch_queue_manager.py       # Job queue and priority management
│   ├── batch_progress_tracker.py    # Progress monitoring
│   ├── batch_memory_manager.py      # Memory usage optimization
│   ├── batch_error_handler.py       # Error recovery strategies
│   ├── error_recovery.py    # Global error recovery
│   ├── error_reporting.py   # Structured error reports
│   └── exceptions.py        # Custom exception hierarchy
├── processing/              # Core extraction pipeline
│   ├── coordinate_engine.py # Coordinate-based field extraction
│   ├── vendor_detector.py   # Automatic vendor identification
│   ├── confidence_scorer.py # Multi-factor confidence analysis
│   ├── pdf_processor.py     # PDF reading and page handling
│   ├── german_field_recognizer.py  # Locale-specific field recognition
│   └── performance_optimizer.py    # Processing performance tuning
├── templates/               # Vendor-specific extraction templates
│   ├── template_manager.py  # Template loading and selection
│   ├── base_template.py     # Abstract template interface
│   ├── jobrad_template.py
│   ├── business_bike_template.py
│   ├── deutsche_dienstrad_template.py
│   └── bls_bikeleasing_template.py
├── gui/                     # Desktop interface (tkinter)
│   ├── main_window.py       # Primary application window
│   ├── pdf_viewer.py        # PDF document viewer
│   ├── results_viewer.py    # Extraction results display
│   ├── batch_manager.py     # Batch processing UI
│   ├── template_editor.py   # Visual template configuration
│   ├── coordinate_zone_editor.py   # Spatial zone editor
│   ├── data_quality_dashboard.py   # Quality metrics dashboard
│   ├── export_manager.py    # Export configuration and preview
│   ├── theme_manager.py     # Dark/light theme system
│   └── wcag_accessibility_system.py # Accessibility compliance
├── data_management/         # Data persistence and export
├── localization/            # i18n and locale-specific formatting
└── utils/                   # Logging, helpers, and shared utilities

Processing Pipeline

PDF Input → Vendor Detection → Template Selection → Coordinate Extraction
    → Field Validation → Confidence Scoring → Manual Review (if low confidence)
    → Data Export (CSV)

Tech Stack

Component	Technology
Language	Python 3.11+
PDF Processing	pdfplumber, PyMuPDF (fitz)
Data Handling	pandas
GUI Framework	tkinter, Pillow, matplotlib
Localization	Babel
Logging	structlog
Testing	pytest, pytest-cov, pytest-mock, pytest-qt
Code Quality	mypy, black, isort, flake8, pylint
Security	bandit, safety
Documentation	Sphinx (Read the Docs theme)

Getting Started

Prerequisites

Python 3.11 or higher
4 GB RAM minimum
500 MB free disk space

Installation

# Clone the repository
git clone https://github.com/stalinrod/financial-document-parser.git
cd financial-document-parser

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate    # Linux/macOS
venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt

Usage

Process a single PDF

python src/main.py invoice.pdf

Batch process a directory

python src/main.py --batch ./invoices/

Export results to CSV

python src/main.py invoice.pdf --export output.csv

Launch the desktop GUI

python src/main.py --gui

Sample Documents

This repository does not include sample PDF files. To test the parser, place your own PDF documents in a local directory and pass the path as an argument:

python src/main.py /path/to/your/documents/invoice.pdf

Programmatic usage

from src.core.app import PDFParserApp

app = PDFParserApp()
app.initialize()

result = app.process_pdf("path/to/invoice.pdf")

if result.success:
    print(f"Vendor: {result.vendor_type}")
    print(f"Confidence: {result.confidence_score:.2f}")
    print(f"Data: {result.extracted_data}")

Configuration

Application settings are managed via settings.json:

{
  "language": "de",
  "theme": "dark",
  "automation_threshold": 0.8,
  "export_directory": "exports/",
  "log_level": "INFO"
}

Key parameters:

automation_threshold — Confidence score above which results are auto-accepted without manual review (default: 0.8)
theme — GUI theme (dark or light)
log_level — Logging verbosity (DEBUG, INFO, WARNING, ERROR)

Testing

# Run full test suite
pytest

# Run with coverage report
pytest --cov=src --cov-report=html

# Run specific test module
pytest tests/integration/test_core_processing_engine.py -v

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
backups		backups
batch_storage		batch_storage
config		config
deployment		deployment
desktop_launcher		desktop_launcher
docs		docs
help_content		help_content
logs		logs
progress_storage		progress_storage
src		src
support		support
testing		testing
testing_framework/core		testing_framework/core
tests/integration		tests/integration
training		training
vendor_pdf_files		vendor_pdf_files
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.json		config.json
extract_sample.py		extract_sample.py
performance_profiler.py		performance_profiler.py
requirements.txt		requirements.txt
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Architecture

Processing Pipeline

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Process a single PDF

Batch process a directory

Export results to CSV

Launch the desktop GUI

Sample Documents

Programmatic usage

Configuration

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Architecture

Processing Pipeline

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Process a single PDF

Batch process a directory

Export results to CSV

Launch the desktop GUI

Sample Documents

Programmatic usage

Configuration

Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages