fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options by jkeeley2073 · Pull Request #108 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-08T16:15:09Z

Summary

Architectural follow-up from the W1-5 close. Closes two ⚠️ findings the W1-5 PR (#104) deferred with justification at the time:

MaxStreamBytes — protects against zip-bomb / multi-GB hostile PDFs. Default 100MB; Phase 4 curated subset PDFs are bounded at ~30MB, so the default leaves headroom for legitimate Phase 4.5 manuals while capping unbounded reads. Guard fires synchronously before Task.Run, gated on stream.CanSeek so non-seekable streams fall through to PdfPig's own parser failure path (which is its existing behavior — non-seekable streams can't be parsed regardless).
OcrRequiredCharFloor — hoisted from a private constant to a configurable option. Default unchanged (32). Tunable per environment when Phase 4.5 corpus expansion surfaces legitimate one-line bulletins shorter than the previous hardcoded floor.
New PdfExtractionOptions class (Application/Rag/Extraction) carries both knobs with [Range] attributes for IDE documentation. AddPdfDocumentTextExtractor DI extension wires AddOptions<PdfExtractionOptions>() with defaults; host overrides via services.Configure<PdfExtractionOptions>(...) before the extension call.
New ExtractionStatus.SizeExceeded enum value. Distinct from Malformed because the cause is operational (oversized upload — possibly attacker probing the ingest endpoint) rather than parser-level. W3-2 Cosmos Change Feed Function will treat both identically for log+skip, but the distinct telemetry signal helps operational triage.

5 new tests (8 → 13): size-exceeded rejects with diagnostic message; within-limit stream processes normally; non-seekable stream skips the guard (NonSeekableStream fixture wraps a backing stream and reports CanSeek=false — MemoryStream is always seekable so we need a wrapper for this branch); raised OcrRequiredCharFloor classifies a small-but-valid doc as OcrRequired (pins that the option is genuinely read — defends against a future "dead config" regression).

No /local-review on this PR — change is architecturally identical to the W1-5 baseline (#104, fully reviewed at the time); this PR strictly hoists two values to options + adds a defensive size guard. Mechanical self-audit covers it.

Mechanical self-audit: all 7 checks pass.

Test plan

Full solution build — 0 warnings, 0 errors
Full test suite — 775 / 775 passing (was 770; +5)
Identity verified — Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>

🤖 Generated with Claude Code

Architectural follow-up #3 from the W1-5 close (memory: session_handoff_2026_05_08_phase4_wave1_close.md § Architectural follow-ups). Closes two W1-5 ⚠️ findings that were deferred with justification at the time: - "MaxStreamBytes default ~100MB to protect against zip-bomb / multi-GB PDFs. v1 trusts the curated subset (bounded); add before Phase 4.5 corpus expansion." - "OcrRequiredCharFloor=32 hardcoded for Phase 4's curated subset; revisit-as-options when the Phase 4.5 corpus expansion exposes edge cases." What this PR ships: - New PdfExtractionOptions class (Application/Rag/Extraction). Two knobs: MaxStreamBytes (default 100MB) and OcrRequiredCharFloor (default 32). Both [Range]-attributed for IDE documentation; consuming host can ValidateDataAnnotations if it wires that package. - New ExtractionStatus.SizeExceeded enum value. Distinct from Malformed because the cause is operational (oversized upload — possibly zip-bomb or hostile multi-GB PDF) rather than a parser- level structural issue. The orchestrator (W3-2 Cosmos Change Feed Function) will treat this identically to Malformed for the log+ skip flow, but the distinct status gives operational telemetry a way to count size-rejections separately from parse failures — helpful for spotting an attacker probing the ingestion endpoint. - PdfPigDocumentTextExtractor takes IOptions<PdfExtractionOptions> in the ctor. Size guard fires synchronously before Task.Run so oversized inputs reject without dispatching a worker thread that's just going to bounce. The check is gated on stream.CanSeek (PdfPig requires seekable input anyway, so non-seekable streams fall through to the parser's own failure path; the guard skip is documented inline + pinned by a test). - AddPdfDocumentTextExtractor DI extension wires AddOptions<PdfExtractionOptions>() with defaults. Host overrides via services.Configure<PdfExtractionOptions>(config.GetSection( PdfExtractionOptions.SectionName)) before the extension call, matching the AddHybridChunker pattern from W2-2. - 5 new tests (was 8, now 13): size-exceeded path returns SizeExceeded with diagnostic message; within-limit stream processes normally; non-seekable stream skips the size check; raised OcrRequiredCharFloor classifies a small-but-valid doc as OcrRequired (pins that the option is genuinely read, not dead config); ctor null-options test added alongside the existing null-logger test. - A small NonSeekableStream test fixture wraps a backing stream and reports CanSeek=false (MemoryStream is always seekable so we need a wrapper to exercise the non-seekable branch). Build: 0 warnings, 0 errors. Tests: 775 / 775 (was 770; +5). Identity: 94459922+jkeeley2073@users.noreply.github.com. No /local-review run on this PR — change is architecturally identical to the W1-5 baseline (PR #104) which had a full review; this PR strictly hoists two values to options + adds a defensive guard. Mechanical self-audit (all 7 checks): both new options properties read by code (verified via grep); no bare catch; no new ISourceScraper so SourceAliasContractTests N/A; tests assert behavior not structure (the OcrRequiredCharFloor test validates the floor is genuinely tunable — feeding a 10000-char threshold and confirming a 56-char doc routes to OcrRequired); zero warnings; identity correct. Files: new src/PinballWizard.Application/Rag/Extraction/PdfExtractionOptions.cs; modified src/PinballWizard.Application/Rag/Extraction/ExtractionStatus.cs (SizeExceeded enum value), src/PinballWizard.Infrastructure/Rag/Extraction/PdfPigDocumentTextExtractor.cs (IOptions ctor + size guard + options-driven OCR floor), src/PinballWizard.Infrastructure/Rag/Extraction/ServiceCollectionExtensions.cs (AddOptions<PdfExtractionOptions>), tests/PinballWizard.Scraper.Tests/Rag/Extraction/PdfPigDocumentTextExtractorTests.cs (NewExtractor helper + NonSeekableStream fixture + 5 new tests). Refs: docs/build-spec.md § Phase 4 scope item 14 (W1-5); session_handoff_2026_05_08_phase4_wave1_close.md § Architectural follow-ups (items 3 + 4 — both closed by this PR).

jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026

jkeeley2073 merged commit d7b48e8 into main May 8, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options#108

fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options#108
jkeeley2073 merged 1 commit into
mainfrom
Dev-PdfPigInputSizeGuard

jkeeley2073 commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkeeley2073 commented May 8, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant