Skip to content

Comments

Pdf validation#60

Merged
jangevaare merged 6 commits intorefactor/single-client-pdfsfrom
feat/pdf-validator
Nov 3, 2025
Merged

Pdf validation#60
jangevaare merged 6 commits intorefactor/single-client-pdfsfrom
feat/pdf-validator

Conversation

@jangevaare
Copy link
Member

-Change pdf page counting to a more generic pdf validation step (retain pdf page counting check)
-Example of using invisible markers in typst to support more specific pdf validation
-Configuration of validation rules (disabled/warn/error) in parameters.yaml
-Documentation and test updates

Copy link
Contributor

@kassyray kassyray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes. I think they seem super useful.

Something that I had in a past version was counting the total number of PDFs in a batch and validating that against the number of clients that should be there.

Thoughts on whether to add this to validate_pdfs or elsewhere?

The JSON report is a nice addition. Either in the log (when implemented) or elsewhere, we can alert when something goes wrong on cli.

@jangevaare
Copy link
Member Author

jangevaare commented Oct 30, 2025

Super.

I think should exist in the manifests created with the batch_pdfs.py step. (and I'm realizing that batch_pdfs.py is maybe not the best name for this as I write this out, maybe bundle_pdfs.py)

Example top of batch manifest below:

{
  "run_id": "20251030T135540",
  "language": "fr",
  "batch_type": "size_based",
  "batch_identifier": null,
  "batch_number": 1,
  "total_batches": 1,
  "batch_size": 100,
  "total_clients": 5,
  "total_pages": 15,
  "sha256": "530f879d3186cd97b4ca5e25425ec8da63d59a1358c129951e114648d5e40989",
  "output_pdf": "pdf_combined/fr_batch_001_of_001.pdf",
  "clients": [
    {
      "sequence": "00001",
      "client_id": "1009876545",
      "full_name": "Scurry Nutcracker",
      "school": "Burrow Public School",
      "board": "",
      "pdf_path": "pdf_individual/fr_notice_00001_1009876545.pdf",
      "artifact_path": "artifacts/preprocessed_clients_20251030T135540.json",
      "pages": 3
    },

@jangevaare
Copy link
Member Author

And the top of the validation output...

{
  "language": "fr",
  "total_pdfs": 5,
  "passed_count": 0,
  "warning_count": 5,
  "page_count_distribution": {
    "3": 5
  },
  "warning_types": {
    "exactly_two_pages": 5,
    "signature_overflow": 5
  },
  "results": [
    {
      "filename": "fr_notice_00001_1009876545.pdf",
      "page_count": 3,
      "warnings": [
        "exactly_two_pages: 3 pages (expected 2)",
        "signature_overflow: Signature block found on page 2 (expected page 1)"
      ],
      "passed": false
    },

@jangevaare
Copy link
Member Author

jangevaare commented Nov 3, 2025

I added an example that uses measurement-based validation (contact info in envelope window), as well as regex-based validation (client id)

I'm going to merge back into main PR.

🚀 Starting VIPER Pipeline
🗂️  Input File: rodent_dataset.xlsx


============================================================
Step 1: Preparing output directory
============================================================
✅ Step 1: Output directory prepared complete in 0.0 seconds.

============================================================
Step 2: Preprocessing
============================================================
📄 Preprocessed artifact: /home/jovyan/pr-cleanup/output/artifacts/preprocessed_clients_20251103T163011.json
Preprocess log written to /home/jovyan/pr-cleanup/output/logs/preprocess_20251103T163011.log
Warnings detected during preprocessing:
 - Missing board name for: Burrow Public School, Cheese Wheel Academy, Nutcracker Academy, Tunnel Academy, Whisker Elementary
👥 Clients normalized: 5
✅ Step 2: Preprocessing complete in 0.2 seconds.

============================================================
Step 3: Generating QR codes
============================================================
Generated 5 QR code PNG file(s) in /home/jovyan/pr-cleanup/output/artifacts/qr_codes/
✅ Step 3: QR code generation complete in 0.1 seconds.

============================================================
Step 4: Generating Typst templates
============================================================
Generated 5 Typst files in /home/jovyan/pr-cleanup/output/artifacts for language fr
Generated 5 Typst files in /home/jovyan/pr-cleanup/output/artifacts
✅ Step 4: Template generation complete in 0.0 seconds.

============================================================
Step 5: Compiling Typst templates
============================================================
Compiled 5 Typst file(s) to PDFs in /home/jovyan/pr-cleanup/output/pdf_individual.
✅ Step 5: Compilation complete in 4.5 seconds.

============================================================
Step 6: Validating compiled PDFs
============================================================
Validation rules:
  - client_id_presence [error]: ✓ 5 passed
  - envelope_window_1_125 [warn]: ✓ 5 passed
  - exactly_two_pages [warn]: ✓ 0 passed, ✗ 5 PDFs failed
  - signature_overflow [warn]: ✓ 0 passed, ✗ 5 PDFs failed

Detailed validation results: output/metadata/fr_validation_20251103T163011.json
✅ Step 6: PDF validation complete in 0.2 seconds.

============================================================
Step 8: Batching PDFs
============================================================
Created 1 batches in /home/jovyan/pr-cleanup/output/pdf_combined
✅ Step 8: Batching complete in 0.1 seconds.

============================================================
Step 9: Cleanup
============================================================
Cleanup skipped (keep_intermediate_files enabled).

🎉 Pipeline completed successfully!
🕒 Time Summary:
  - Output Preparation        0.0s
  - Preprocessing             0.2s
  - QR Code Generation        0.1s
  - Template Generation       0.0s
  - Template Compilation      4.5s
  - PDF Validation            0.2s
  - PDF Batching              0.1s
  - ───────────────────────── ──────
  - Total Time                5.1s

📦 Batch size:             100
🏷️  Batch scope:            Sequential
👥 Clients processed:      5
🧹 Cleanup:                Skipped

@jangevaare jangevaare merged commit 69a93d9 into refactor/single-client-pdfs Nov 3, 2025
@kassyray kassyray deleted the feat/pdf-validator branch November 6, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants