Skip to content

stalinrod/financial-document-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting Started# Financial Document Parser

A production-grade desktop application for automated extraction, validation, and export of structured data from PDF financial documents. Features a coordinate-based extraction engine, intelligent vendor auto-detection, multi-level confidence scoring, batch processing, and a full desktop GUI.

Originally built for a German financial services client to process vendor invoices at scale. Designed with extensible architecture to support additional document formats and locales.


Features

  • Coordinate-Based Extraction Engine — Spatially-aware field extraction using configurable coordinate zones per document template, not fragile regex patterns. Supports text, numeric, date, and currency field types with fallback zones and validation patterns.
  • Automatic Vendor Detection — Multi-layered analysis (keyword matching, pattern recognition, document structure) to identify document source and select the correct extraction template automatically.
  • Multi-Factor Confidence Scoring — Evaluates extraction quality based on data completeness, format validation, vendor-specific patterns, and consistency checks. Outputs a confidence level (low / medium / high) with field-level scores and actionable recommendations.
  • Batch Processing — Multi-threaded processing engine with job queue management, memory monitoring, progress tracking, and comprehensive error recovery. Supports priority queuing and template overrides per job.
  • Desktop GUI — Full tkinter interface with dark/light themes, PDF viewer, results display, batch queue viewer, template editor, coordinate zone editor, data quality dashboard, and export dialogs.
  • WCAG Accessibility — Built-in accessibility enhancements and compliance for the desktop interface.
  • Configurable Template System — Vendor-specific extraction templates with a visual template editor and template library manager. Easy to extend for new document formats.
  • Data Export — CSV export with configurable formatting, export preview, and localized output.
  • Production Infrastructure — Structured logging (structlog), error recovery, performance profiling, security scanning (bandit/safety), type checking (mypy), and a comprehensive test suite (pytest).

Architecture

src/
├── main.py                  # CLI entry point (single file, batch, or GUI mode)
├── production_app.py        # Production application wrapper
├── core/                    # Application lifecycle and orchestration
│   ├── app.py               # Main app controller and processing pipeline
│   ├── config.py            # Configuration management
│   ├── settings_manager.py  # Runtime settings
│   ├── batch_processor_advanced.py  # Multi-threaded batch engine
│   ├── batch_queue_manager.py       # Job queue and priority management
│   ├── batch_progress_tracker.py    # Progress monitoring
│   ├── batch_memory_manager.py      # Memory usage optimization
│   ├── batch_error_handler.py       # Error recovery strategies
│   ├── error_recovery.py    # Global error recovery
│   ├── error_reporting.py   # Structured error reports
│   └── exceptions.py        # Custom exception hierarchy
├── processing/              # Core extraction pipeline
│   ├── coordinate_engine.py # Coordinate-based field extraction
│   ├── vendor_detector.py   # Automatic vendor identification
│   ├── confidence_scorer.py # Multi-factor confidence analysis
│   ├── pdf_processor.py     # PDF reading and page handling
│   ├── german_field_recognizer.py  # Locale-specific field recognition
│   └── performance_optimizer.py    # Processing performance tuning
├── templates/               # Vendor-specific extraction templates
│   ├── template_manager.py  # Template loading and selection
│   ├── base_template.py     # Abstract template interface
│   ├── jobrad_template.py
│   ├── business_bike_template.py
│   ├── deutsche_dienstrad_template.py
│   └── bls_bikeleasing_template.py
├── gui/                     # Desktop interface (tkinter)
│   ├── main_window.py       # Primary application window
│   ├── pdf_viewer.py        # PDF document viewer
│   ├── results_viewer.py    # Extraction results display
│   ├── batch_manager.py     # Batch processing UI
│   ├── template_editor.py   # Visual template configuration
│   ├── coordinate_zone_editor.py   # Spatial zone editor
│   ├── data_quality_dashboard.py   # Quality metrics dashboard
│   ├── export_manager.py    # Export configuration and preview
│   ├── theme_manager.py     # Dark/light theme system
│   └── wcag_accessibility_system.py # Accessibility compliance
├── data_management/         # Data persistence and export
├── localization/            # i18n and locale-specific formatting
└── utils/                   # Logging, helpers, and shared utilities

Processing Pipeline

PDF Input → Vendor Detection → Template Selection → Coordinate Extraction
    → Field Validation → Confidence Scoring → Manual Review (if low confidence)
    → Data Export (CSV)

Tech Stack

Component Technology
Language Python 3.11+
PDF Processing pdfplumber, PyMuPDF (fitz)
Data Handling pandas
GUI Framework tkinter, Pillow, matplotlib
Localization Babel
Logging structlog
Testing pytest, pytest-cov, pytest-mock, pytest-qt
Code Quality mypy, black, isort, flake8, pylint
Security bandit, safety
Documentation Sphinx (Read the Docs theme)

Getting Started

Prerequisites

  • Python 3.11 or higher
  • 4 GB RAM minimum
  • 500 MB free disk space

Installation

# Clone the repository
git clone https://github.com/stalinrod/financial-document-parser.git
cd financial-document-parser

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate    # Linux/macOS
venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt

Usage

Process a single PDF

python src/main.py invoice.pdf

Batch process a directory

python src/main.py --batch ./invoices/

Export results to CSV

python src/main.py invoice.pdf --export output.csv

Launch the desktop GUI

python src/main.py --gui

Sample Documents

This repository does not include sample PDF files. To test the parser, place your own PDF documents in a local directory and pass the path as an argument:

python src/main.py /path/to/your/documents/invoice.pdf

Programmatic usage

from src.core.app import PDFParserApp

app = PDFParserApp()
app.initialize()

result = app.process_pdf("path/to/invoice.pdf")

if result.success:
    print(f"Vendor: {result.vendor_type}")
    print(f"Confidence: {result.confidence_score:.2f}")
    print(f"Data: {result.extracted_data}")

Configuration

Application settings are managed via settings.json:

{
  "language": "de",
  "theme": "dark",
  "automation_threshold": 0.8,
  "export_directory": "exports/",
  "log_level": "INFO"
}

Key parameters:

  • automation_threshold — Confidence score above which results are auto-accepted without manual review (default: 0.8)
  • theme — GUI theme (dark or light)
  • log_level — Logging verbosity (DEBUG, INFO, WARNING, ERROR)

Testing

# Run full test suite
pytest

# Run with coverage report
pytest --cov=src --cov-report=html

# Run specific test module
pytest tests/integration/test_core_processing_engine.py -v

License

MIT License — see LICENSE for details.

About

Production-grade PDF document parser with coordinate-based extraction, vendor auto-detection, confidence scoring, and batch processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages