fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options#108
Merged
Conversation
Architectural follow-up #3 from the W1-5 close (memory: session_handoff_2026_05_08_phase4_wave1_close.md § Architectural follow-ups). Closes two W1-5⚠️ findings that were deferred with justification at the time: - "MaxStreamBytes default ~100MB to protect against zip-bomb / multi-GB PDFs. v1 trusts the curated subset (bounded); add before Phase 4.5 corpus expansion." - "OcrRequiredCharFloor=32 hardcoded for Phase 4's curated subset; revisit-as-options when the Phase 4.5 corpus expansion exposes edge cases." What this PR ships: - New PdfExtractionOptions class (Application/Rag/Extraction). Two knobs: MaxStreamBytes (default 100MB) and OcrRequiredCharFloor (default 32). Both [Range]-attributed for IDE documentation; consuming host can ValidateDataAnnotations if it wires that package. - New ExtractionStatus.SizeExceeded enum value. Distinct from Malformed because the cause is operational (oversized upload — possibly zip-bomb or hostile multi-GB PDF) rather than a parser- level structural issue. The orchestrator (W3-2 Cosmos Change Feed Function) will treat this identically to Malformed for the log+ skip flow, but the distinct status gives operational telemetry a way to count size-rejections separately from parse failures — helpful for spotting an attacker probing the ingestion endpoint. - PdfPigDocumentTextExtractor takes IOptions<PdfExtractionOptions> in the ctor. Size guard fires synchronously before Task.Run so oversized inputs reject without dispatching a worker thread that's just going to bounce. The check is gated on stream.CanSeek (PdfPig requires seekable input anyway, so non-seekable streams fall through to the parser's own failure path; the guard skip is documented inline + pinned by a test). - AddPdfDocumentTextExtractor DI extension wires AddOptions<PdfExtractionOptions>() with defaults. Host overrides via services.Configure<PdfExtractionOptions>(config.GetSection( PdfExtractionOptions.SectionName)) before the extension call, matching the AddHybridChunker pattern from W2-2. - 5 new tests (was 8, now 13): size-exceeded path returns SizeExceeded with diagnostic message; within-limit stream processes normally; non-seekable stream skips the size check; raised OcrRequiredCharFloor classifies a small-but-valid doc as OcrRequired (pins that the option is genuinely read, not dead config); ctor null-options test added alongside the existing null-logger test. - A small NonSeekableStream test fixture wraps a backing stream and reports CanSeek=false (MemoryStream is always seekable so we need a wrapper to exercise the non-seekable branch). Build: 0 warnings, 0 errors. Tests: 775 / 775 (was 770; +5). Identity: 94459922+jkeeley2073@users.noreply.github.com. No /local-review run on this PR — change is architecturally identical to the W1-5 baseline (PR #104) which had a full review; this PR strictly hoists two values to options + adds a defensive guard. Mechanical self-audit (all 7 checks): both new options properties read by code (verified via grep); no bare catch; no new ISourceScraper so SourceAliasContractTests N/A; tests assert behavior not structure (the OcrRequiredCharFloor test validates the floor is genuinely tunable — feeding a 10000-char threshold and confirming a 56-char doc routes to OcrRequired); zero warnings; identity correct. Files: new src/PinballWizard.Application/Rag/Extraction/PdfExtractionOptions.cs; modified src/PinballWizard.Application/Rag/Extraction/ExtractionStatus.cs (SizeExceeded enum value), src/PinballWizard.Infrastructure/Rag/Extraction/PdfPigDocumentTextExtractor.cs (IOptions ctor + size guard + options-driven OCR floor), src/PinballWizard.Infrastructure/Rag/Extraction/ServiceCollectionExtensions.cs (AddOptions<PdfExtractionOptions>), tests/PinballWizard.Scraper.Tests/Rag/Extraction/PdfPigDocumentTextExtractorTests.cs (NewExtractor helper + NonSeekableStream fixture + 5 new tests). Refs: docs/build-spec.md § Phase 4 scope item 14 (W1-5); session_handoff_2026_05_08_phase4_wave1_close.md § Architectural follow-ups (items 3 + 4 — both closed by this PR).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Architectural follow-up from the W1-5 close. Closes two⚠️ findings the W1-5 PR (#104) deferred with justification at the time:
MaxStreamBytes— protects against zip-bomb / multi-GB hostile PDFs. Default 100MB; Phase 4 curated subset PDFs are bounded at ~30MB, so the default leaves headroom for legitimate Phase 4.5 manuals while capping unbounded reads. Guard fires synchronously beforeTask.Run, gated onstream.CanSeekso non-seekable streams fall through to PdfPig's own parser failure path (which is its existing behavior — non-seekable streams can't be parsed regardless).OcrRequiredCharFloor— hoisted from a private constant to a configurable option. Default unchanged (32). Tunable per environment when Phase 4.5 corpus expansion surfaces legitimate one-line bulletins shorter than the previous hardcoded floor.New
PdfExtractionOptionsclass (Application/Rag/Extraction) carries both knobs with[Range]attributes for IDE documentation.AddPdfDocumentTextExtractorDI extension wiresAddOptions<PdfExtractionOptions>()with defaults; host overrides viaservices.Configure<PdfExtractionOptions>(...)before the extension call.New
ExtractionStatus.SizeExceededenum value. Distinct fromMalformedbecause the cause is operational (oversized upload — possibly attacker probing the ingest endpoint) rather than parser-level. W3-2 Cosmos Change Feed Function will treat both identically for log+skip, but the distinct telemetry signal helps operational triage.5 new tests (8 → 13): size-exceeded rejects with diagnostic message; within-limit stream processes normally; non-seekable stream skips the guard (
NonSeekableStreamfixture wraps a backing stream and reportsCanSeek=false—MemoryStreamis always seekable so we need a wrapper for this branch); raisedOcrRequiredCharFloorclassifies a small-but-valid doc asOcrRequired(pins that the option is genuinely read — defends against a future "dead config" regression).No
/local-reviewon this PR — change is architecturally identical to the W1-5 baseline (#104, fully reviewed at the time); this PR strictly hoists two values to options + adds a defensive size guard. Mechanical self-audit covers it.Mechanical self-audit: all 7 checks pass.
Test plan
Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>🤖 Generated with Claude Code