Getting Started# Financial Document Parser
A production-grade desktop application for automated extraction, validation, and export of structured data from PDF financial documents. Features a coordinate-based extraction engine, intelligent vendor auto-detection, multi-level confidence scoring, batch processing, and a full desktop GUI.
Originally built for a German financial services client to process vendor invoices at scale. Designed with extensible architecture to support additional document formats and locales.
- Coordinate-Based Extraction Engine — Spatially-aware field extraction using configurable coordinate zones per document template, not fragile regex patterns. Supports text, numeric, date, and currency field types with fallback zones and validation patterns.
- Automatic Vendor Detection — Multi-layered analysis (keyword matching, pattern recognition, document structure) to identify document source and select the correct extraction template automatically.
- Multi-Factor Confidence Scoring — Evaluates extraction quality based on data completeness, format validation, vendor-specific patterns, and consistency checks. Outputs a confidence level (low / medium / high) with field-level scores and actionable recommendations.
- Batch Processing — Multi-threaded processing engine with job queue management, memory monitoring, progress tracking, and comprehensive error recovery. Supports priority queuing and template overrides per job.
- Desktop GUI — Full tkinter interface with dark/light themes, PDF viewer, results display, batch queue viewer, template editor, coordinate zone editor, data quality dashboard, and export dialogs.
- WCAG Accessibility — Built-in accessibility enhancements and compliance for the desktop interface.
- Configurable Template System — Vendor-specific extraction templates with a visual template editor and template library manager. Easy to extend for new document formats.
- Data Export — CSV export with configurable formatting, export preview, and localized output.
- Production Infrastructure — Structured logging (structlog), error recovery, performance profiling, security scanning (bandit/safety), type checking (mypy), and a comprehensive test suite (pytest).
src/
├── main.py # CLI entry point (single file, batch, or GUI mode)
├── production_app.py # Production application wrapper
├── core/ # Application lifecycle and orchestration
│ ├── app.py # Main app controller and processing pipeline
│ ├── config.py # Configuration management
│ ├── settings_manager.py # Runtime settings
│ ├── batch_processor_advanced.py # Multi-threaded batch engine
│ ├── batch_queue_manager.py # Job queue and priority management
│ ├── batch_progress_tracker.py # Progress monitoring
│ ├── batch_memory_manager.py # Memory usage optimization
│ ├── batch_error_handler.py # Error recovery strategies
│ ├── error_recovery.py # Global error recovery
│ ├── error_reporting.py # Structured error reports
│ └── exceptions.py # Custom exception hierarchy
├── processing/ # Core extraction pipeline
│ ├── coordinate_engine.py # Coordinate-based field extraction
│ ├── vendor_detector.py # Automatic vendor identification
│ ├── confidence_scorer.py # Multi-factor confidence analysis
│ ├── pdf_processor.py # PDF reading and page handling
│ ├── german_field_recognizer.py # Locale-specific field recognition
│ └── performance_optimizer.py # Processing performance tuning
├── templates/ # Vendor-specific extraction templates
│ ├── template_manager.py # Template loading and selection
│ ├── base_template.py # Abstract template interface
│ ├── jobrad_template.py
│ ├── business_bike_template.py
│ ├── deutsche_dienstrad_template.py
│ └── bls_bikeleasing_template.py
├── gui/ # Desktop interface (tkinter)
│ ├── main_window.py # Primary application window
│ ├── pdf_viewer.py # PDF document viewer
│ ├── results_viewer.py # Extraction results display
│ ├── batch_manager.py # Batch processing UI
│ ├── template_editor.py # Visual template configuration
│ ├── coordinate_zone_editor.py # Spatial zone editor
│ ├── data_quality_dashboard.py # Quality metrics dashboard
│ ├── export_manager.py # Export configuration and preview
│ ├── theme_manager.py # Dark/light theme system
│ └── wcag_accessibility_system.py # Accessibility compliance
├── data_management/ # Data persistence and export
├── localization/ # i18n and locale-specific formatting
└── utils/ # Logging, helpers, and shared utilities
PDF Input → Vendor Detection → Template Selection → Coordinate Extraction
→ Field Validation → Confidence Scoring → Manual Review (if low confidence)
→ Data Export (CSV)
| Component | Technology |
|---|---|
| Language | Python 3.11+ |
| PDF Processing | pdfplumber, PyMuPDF (fitz) |
| Data Handling | pandas |
| GUI Framework | tkinter, Pillow, matplotlib |
| Localization | Babel |
| Logging | structlog |
| Testing | pytest, pytest-cov, pytest-mock, pytest-qt |
| Code Quality | mypy, black, isort, flake8, pylint |
| Security | bandit, safety |
| Documentation | Sphinx (Read the Docs theme) |
- Python 3.11 or higher
- 4 GB RAM minimum
- 500 MB free disk space
# Clone the repository
git clone https://github.com/stalinrod/financial-document-parser.git
cd financial-document-parser
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtpython src/main.py invoice.pdfpython src/main.py --batch ./invoices/python src/main.py invoice.pdf --export output.csvpython src/main.py --guiThis repository does not include sample PDF files. To test the parser, place your own PDF documents in a local directory and pass the path as an argument:
python src/main.py /path/to/your/documents/invoice.pdffrom src.core.app import PDFParserApp
app = PDFParserApp()
app.initialize()
result = app.process_pdf("path/to/invoice.pdf")
if result.success:
print(f"Vendor: {result.vendor_type}")
print(f"Confidence: {result.confidence_score:.2f}")
print(f"Data: {result.extracted_data}")Application settings are managed via settings.json:
{
"language": "de",
"theme": "dark",
"automation_threshold": 0.8,
"export_directory": "exports/",
"log_level": "INFO"
}Key parameters:
automation_threshold— Confidence score above which results are auto-accepted without manual review (default: 0.8)theme— GUI theme (darkorlight)log_level— Logging verbosity (DEBUG,INFO,WARNING,ERROR)
# Run full test suite
pytest
# Run with coverage report
pytest --cov=src --cov-report=html
# Run specific test module
pytest tests/integration/test_core_processing_engine.py -vMIT License — see LICENSE for details.