Skip to content

fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options#108

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-PdfPigInputSizeGuard
May 8, 2026
Merged

fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options#108
jkeeley2073 merged 1 commit into
mainfrom
Dev-PdfPigInputSizeGuard

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Architectural follow-up from the W1-5 close. Closes two ⚠️ findings the W1-5 PR (#104) deferred with justification at the time:

  • MaxStreamBytes — protects against zip-bomb / multi-GB hostile PDFs. Default 100MB; Phase 4 curated subset PDFs are bounded at ~30MB, so the default leaves headroom for legitimate Phase 4.5 manuals while capping unbounded reads. Guard fires synchronously before Task.Run, gated on stream.CanSeek so non-seekable streams fall through to PdfPig's own parser failure path (which is its existing behavior — non-seekable streams can't be parsed regardless).

  • OcrRequiredCharFloor — hoisted from a private constant to a configurable option. Default unchanged (32). Tunable per environment when Phase 4.5 corpus expansion surfaces legitimate one-line bulletins shorter than the previous hardcoded floor.

  • New PdfExtractionOptions class (Application/Rag/Extraction) carries both knobs with [Range] attributes for IDE documentation. AddPdfDocumentTextExtractor DI extension wires AddOptions<PdfExtractionOptions>() with defaults; host overrides via services.Configure<PdfExtractionOptions>(...) before the extension call.

  • New ExtractionStatus.SizeExceeded enum value. Distinct from Malformed because the cause is operational (oversized upload — possibly attacker probing the ingest endpoint) rather than parser-level. W3-2 Cosmos Change Feed Function will treat both identically for log+skip, but the distinct telemetry signal helps operational triage.

5 new tests (8 → 13): size-exceeded rejects with diagnostic message; within-limit stream processes normally; non-seekable stream skips the guard (NonSeekableStream fixture wraps a backing stream and reports CanSeek=falseMemoryStream is always seekable so we need a wrapper for this branch); raised OcrRequiredCharFloor classifies a small-but-valid doc as OcrRequired (pins that the option is genuinely read — defends against a future "dead config" regression).

No /local-review on this PR — change is architecturally identical to the W1-5 baseline (#104, fully reviewed at the time); this PR strictly hoists two values to options + adds a defensive size guard. Mechanical self-audit covers it.

Mechanical self-audit: all 7 checks pass.

Test plan

  • Full solution build — 0 warnings, 0 errors
  • Full test suite — 775 / 775 passing (was 770; +5)
  • Identity verified — Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>

🤖 Generated with Claude Code

Architectural follow-up #3 from the W1-5 close (memory:
session_handoff_2026_05_08_phase4_wave1_close.md § Architectural
follow-ups). Closes two W1-5 ⚠️ findings that were deferred with
justification at the time:

  - "MaxStreamBytes default ~100MB to protect against zip-bomb /
    multi-GB PDFs. v1 trusts the curated subset (bounded); add before
    Phase 4.5 corpus expansion."
  - "OcrRequiredCharFloor=32 hardcoded for Phase 4's curated subset;
    revisit-as-options when the Phase 4.5 corpus expansion exposes
    edge cases."

What this PR ships:

- New PdfExtractionOptions class (Application/Rag/Extraction). Two
  knobs: MaxStreamBytes (default 100MB) and OcrRequiredCharFloor
  (default 32). Both [Range]-attributed for IDE documentation;
  consuming host can ValidateDataAnnotations if it wires that
  package.

- New ExtractionStatus.SizeExceeded enum value. Distinct from
  Malformed because the cause is operational (oversized upload —
  possibly zip-bomb or hostile multi-GB PDF) rather than a parser-
  level structural issue. The orchestrator (W3-2 Cosmos Change Feed
  Function) will treat this identically to Malformed for the log+
  skip flow, but the distinct status gives operational telemetry a
  way to count size-rejections separately from parse failures —
  helpful for spotting an attacker probing the ingestion endpoint.

- PdfPigDocumentTextExtractor takes IOptions<PdfExtractionOptions>
  in the ctor. Size guard fires synchronously before Task.Run so
  oversized inputs reject without dispatching a worker thread that's
  just going to bounce. The check is gated on stream.CanSeek
  (PdfPig requires seekable input anyway, so non-seekable streams
  fall through to the parser's own failure path; the guard skip
  is documented inline + pinned by a test).

- AddPdfDocumentTextExtractor DI extension wires
  AddOptions<PdfExtractionOptions>() with defaults. Host overrides
  via services.Configure<PdfExtractionOptions>(config.GetSection(
  PdfExtractionOptions.SectionName)) before the extension call,
  matching the AddHybridChunker pattern from W2-2.

- 5 new tests (was 8, now 13): size-exceeded path returns
  SizeExceeded with diagnostic message; within-limit stream
  processes normally; non-seekable stream skips the size check;
  raised OcrRequiredCharFloor classifies a small-but-valid doc as
  OcrRequired (pins that the option is genuinely read, not dead
  config); ctor null-options test added alongside the existing
  null-logger test.

- A small NonSeekableStream test fixture wraps a backing stream
  and reports CanSeek=false (MemoryStream is always seekable so we
  need a wrapper to exercise the non-seekable branch).

Build: 0 warnings, 0 errors. Tests: 775 / 775 (was 770; +5).
Identity: 94459922+jkeeley2073@users.noreply.github.com.

No /local-review run on this PR — change is architecturally
identical to the W1-5 baseline (PR #104) which had a full review;
this PR strictly hoists two values to options + adds a defensive
guard. Mechanical self-audit (all 7 checks): both new options
properties read by code (verified via grep); no bare catch; no new
ISourceScraper so SourceAliasContractTests N/A; tests assert
behavior not structure (the OcrRequiredCharFloor test validates the
floor is genuinely tunable — feeding a 10000-char threshold and
confirming a 56-char doc routes to OcrRequired); zero warnings;
identity correct.

Files: new
  src/PinballWizard.Application/Rag/Extraction/PdfExtractionOptions.cs;
modified
  src/PinballWizard.Application/Rag/Extraction/ExtractionStatus.cs
    (SizeExceeded enum value),
  src/PinballWizard.Infrastructure/Rag/Extraction/PdfPigDocumentTextExtractor.cs
    (IOptions ctor + size guard + options-driven OCR floor),
  src/PinballWizard.Infrastructure/Rag/Extraction/ServiceCollectionExtensions.cs
    (AddOptions<PdfExtractionOptions>),
  tests/PinballWizard.Scraper.Tests/Rag/Extraction/PdfPigDocumentTextExtractorTests.cs
    (NewExtractor helper + NonSeekableStream fixture + 5 new tests).

Refs: docs/build-spec.md § Phase 4 scope item 14 (W1-5);
session_handoff_2026_05_08_phase4_wave1_close.md § Architectural
follow-ups (items 3 + 4 — both closed by this PR).
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026
@jkeeley2073 jkeeley2073 merged commit d7b48e8 into main May 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant