feat(rag) Phase 4 W1-5 — PdfPig wrapper + text-extraction service by jkeeley2073 · Pull Request #104 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-08T15:04:09Z

Summary

Phase 4 W1-5 per docs/build-spec.md § Phase 4 scope item 14. Pure unit-testable; no Cosmos / AI Search deps. Sequenced before W2-2 (hybrid chunker per ADR-0019) which consumes IDocumentTextExtractor for per-page text + outline.

This is the second-to-last Wave 1 PR (only the W1-4 H1 operational hand-off remains, which is operator-side, not a PR).

What ships

IDocumentTextExtractor abstraction in Application/Rag/Extraction/. ExtractAsync(Stream, CancellationToken) returns ExtractedDocument with per-page text, an outline list, and an ExtractionStatus enum (Success / OcrRequired / Encrypted / Malformed). Failure modes surface as Status values rather than exceptions — the W3-2 Cosmos Change Feed Function can log+skip without try/catch.
ExtractedDocument record with ExtractedPage (PageNumber + Text) and OutlineEntry (Title + PageNumber + Level) nested records. Per-page PageNumber preserved from PdfPig's 1-based page.Number for ADR-0019's page-anchor citations + ADR-0021's page_start/page_end index fields.
PdfPigDocumentTextExtractor in Infrastructure/Rag/Extraction/. Wraps UglyToad.PdfPig 0.1.14 (NuGet package PdfPig). Single try/catch wraps PdfDocument.Open AND the entire using (document) body — page enumeration, page.Text access, TryGetBookmarks. PdfPig is known to throw mid-stream on malformed-but-openable PDFs; the wider catch keeps the structured-result-on-failure contract intact.
AddPdfDocumentTextExtractor DI extension (Singleton; extractor is stateless + thread-safe).
PdfPig 0.1.14 NuGet package added to Infrastructure csproj. Pure-managed (no native deps); compatible with the showcase posture (no GPL/AGPL like iText 7 community).
8 unit tests with programmatic fixture PDFs generated via PdfPig's own writer — no committed binary blobs. Coverage: ctor null-arg guard, null stream, malformed bytes, empty stream, success-with-text, multi-page PageNumber preservation, no-text PDF → OcrRequired heuristic, cancellation propagation.

Test Plan

dotnet build PinballWizard.slnx — 0 warnings, 0 errors
dotnet test PinballWizard.slnx — 732 / 732 passing (was 724; +8 new PdfPigDocumentTextExtractorTests)
Sibling-diff against Integrations/{Foundry,AiSearch}/ smoke probes — see Local review section below
Real-PDF fixture validation — deferred to W2-2 (chunker tests exercise real Stern bulletins / JJP manuals end-to-end through the extractor)

Out of Scope

OCR fallback for OcrRequired PDFs — per Phase 4 § Deferred features, Phase 4.5 makes the OCR-vs-defer decision (Azure Document Intelligence vs. accepting a coverage gap).
Real-PDF fixture suite — local review flagged this as a minor item; deferred to W2-2 where the chunker integration tests will exercise real Stern bulletins, JJP manuals, and CGC remake docs end-to-end.
PdfExtractionOptions for the OcrRequiredCharFloor — local review flagged this hardcoded magic number; deferred to Phase 4.x as a hoist-to-options follow-up. v1 hardcoded value (32 chars) is conservative for the curated subset.
Input-size guards (zip-bomb / multi-GB PDF protection) — local review flagged this as a Phase 4.5 corpus-expansion concern; v1 trusts the curated subset's bounded size.
DI wiring in CLI Program.cs — AddPdfDocumentTextExtractor is the public extension; the CLI doesn't yet consume it because the extractor's first caller is W2-2's chunker (not the CLI). Wire-up happens in the PR that adds the chunker.

Checklist

CI is green (build + test + coverage + CodeQL + sanitization) — pre-push verified locally
PR title follows the Conventional Commits format
If this is a new architectural decision, an ADR has been added — N/A (W1-5 implements ADR-0019's prerequisite extraction step; no new architectural decision)
If user-visible behavior changes, README.md and/or docs/ are updated — N/A (Application abstraction; no end-user-visible surface; W2-2 + W3-2 are the consuming PRs that will surface user-visible behavior)
If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed — N/A (the wave 0 close + cleanup handoffs reference W1-5 as the next item; they remain accurate)
No TODO / FIXME / commented-out code committed
No new entries in <NoWarn> without a comment — N/A

Pre-push self-audit

Step 0 — `/local-review` (qualitative)

Ran /local-review and addressed every 🔴 finding before push
Local review outcome: 1 🔴 (fixed pre-push) / 4 ⚠️ (1 fixed inline, 3 deferred-with-justification) / 5 categories ✅
- 🔴 (fixed): The using (document) block was originally outside the try/catch — exceptions thrown during page enumeration or outline extraction (after PdfDocument.Open succeeds) would have bypassed the IDocumentTextExtractor structured-result-on-failure contract. Fixed by widening the try/catch to wrap the entire using (document) body. Now any PdfPig exception mid-stream is caught and classified as Malformed per the contract.
- ⚠️ Round 2 polish: README, HTTP retry, cross-source linking, DOM research #1 (fixed inline): Logger-severity drift from sibling probes. PdfPigDocumentTextExtractor logs at LogWarning for Encrypted + Malformed; AzureFoundrySmokeProbe / AzureAiSearchSmokeProbe log at LogError. Kept divergent (encrypted/malformed PDFs are expected during noisy-corpus ingestion; smoke probe failures are operational). Rationale documented inline at PdfPigDocumentTextExtractor.cs:54-62 so a future contributor doesn't "fix" the divergence away.
- ⚠️ Replace hand-rolled HTTP retry with Microsoft.Extensions.Http.Resilience #2 (deferred): OcrRequiredCharFloor=32 as magic number. Acceptable for Phase 4's curated subset (~10 PDFs, all known-good modern manuals). Hoist-to-options is a Phase 4.x follow-up when corpus expansion exposes edge cases. Comment-noted inline at PdfPigDocumentTextExtractor.cs:25-30.
- ⚠️ Reframe project principle: enterprise-quality portfolio bar #3 (deferred): Unbounded input size. Phase 4 curated subset is bounded; v1 trusts the input. Add a MaxStreamBytes guard before Phase 4.5 corpus expansion.
- ⚠️ Enterprise quality bar: build hardening, CI/CD gates, security tooling, integration tests #4 (deferred): Real-PDF fixture suite. Programmatic fixtures via PdfPig's own writer test PdfPig against itself, which is acceptable for structural assertions but doesn't catch Adobe / InDesign / 1990s-typesetter quirks. W2-2 chunker integration tests will exercise real PDFs end-to-end.

Step 1 — Mechanical checklist

Every new *Options property has at least one real getter call in src/ — N/A (no *Options added; the extractor is configuration-free; OcrRequiredCharFloor is private const)
Sibling-diffed against AzureFoundrySmokeProbe + AzureAiSearchSmokeProbe (see local-review category 4): identical ctor null-checks (ArgumentNullException.ThrowIfNull), identical TryAddSingleton DI registration shape, identical structured-result-on-failure contract pattern. Drift: logger severity (justified inline; not a bug); structured-result shape couples Status enum + Error string vs. the probes' flat Success+payload (justified — extraction has 4 distinct failure modes vs. binary smoke probe).
No bare catch { } — all catches scoped to specific exception types or catch (Exception ex) when (ex is not OperationCanceledException)
New ISourceScraper? — N/A
Tests assert behavior, not just structure — page-number sequencing test asserts Pages[0].PageNumber == 1, Pages[1].PageNumber == 2, Pages[2].PageNumber == 3 against a 3-page PDF; would fail if a buggy implementation re-numbered pages from 0. Multi-page text content test asserts each page's expected text appears in the corresponding page's .Text.
Build is zero-warning — verified 0 Warning(s), 0 Error(s)
git log -1 --format='%an <%ae>' shows personal noreply, not work email — confirmed Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>

Phase 4 W1-5 per docs/build-spec.md § Phase 4 scope item 14. Pure unit-testable; no Cosmos / AI Search deps. Sequenced before W2-2 (hybrid chunker, ADR-0019) which consumes IDocumentTextExtractor for per-page text + outline. What this PR ships: - New `IDocumentTextExtractor` abstraction in `src/PinballWizard.Application/Rag/Extraction/`. ExtractAsync takes a Stream + CancellationToken, returns ExtractedDocument with per-page Text, OutlineEntry list, Pages list, and an ExtractionStatus enum (Success / OcrRequired / Encrypted / Malformed). Failure modes surface as Status values rather than exceptions so the W3-2 Cosmos Change Feed Function can log+skip without try/catch. - ExtractedDocument record (with ExtractedPage + OutlineEntry nested records) and ExtractionStatus enum. Per-page PageNumber preserved from PdfPig's 1-based page.Number for ADR-0019's page-anchor citations + ADR-0021's `page_start`/`page_end` index fields. - New `PdfPigDocumentTextExtractor` in `src/PinballWizard.Infrastructure/Rag/Extraction/`. Wraps UglyToad.PdfPig 0.1.14 (the PdfPig NuGet package; UglyToad.PdfPig is the namespace). Single try/catch wraps PdfDocument.Open AND every operation that touches the document — page enumeration, page.Text access, TryGetBookmarks. PdfPig is known to throw mid-stream on malformed-but-openable PDFs (truncated content streams, invalid font references, broken xref tables only surfacing at content time); the wider catch ensures the IDocumentTextExtractor structured-result-on-failure contract holds. - AddPdfDocumentTextExtractor DI extension (Singleton; extractor is stateless + thread-safe). - New PdfPig 0.1.14 NuGet package added to Infrastructure csproj. Pure-managed (no native deps); compatible with the showcase posture (no GPL/AGPL like iText 7 community license). - 8 unit tests with programmatic fixture PDFs generated via PdfPig's own writer (UglyToad.PdfPig.Writer.PdfDocumentBuilder) — keeps the test suite self-contained without committing binary blobs. Coverage: ctor null-arg guard, null stream, malformed bytes, empty stream, success-with-text, multi-page page-number preservation, no-text PDF → OcrRequired (heuristic floor 32 chars), cancellation propagation. OcrRequiredCharFloor=32 is hardcoded for Phase 4's curated subset (~10 PDFs, all known-good modern manuals); revisit-as-options when the Phase 4.5 corpus expansion exposes edge cases (e.g., one-line service bulletins that are legitimate but short). Comment-noted inline at PdfPigDocumentTextExtractor.cs:25-30. Logger severity is LogWarning (not LogError) for Encrypted + Malformed branches because both are expected outcomes during ingestion of a noisy real-world corpus — distinct posture from the Foundry / AI Search smoke probes which log at LogError because their failure surface IS operational. Comment-noted inline. Local review: 1 🔴 (fixed pre-push: widened the try/catch to wrap the entire `using (document)` body; the original structure had `using` outside the try, so exceptions thrown during page enumeration or outline extraction would bypass the structured- result-on-failure contract). 4 ⚠️ addressed inline: - OcrRequiredCharFloor as magic number — hoist-as-options deferred to Phase 4.x with comment-noted rationale - Unbounded input size — defer to Phase 4.5 corpus expansion (curated subset is bounded; documented as Phase 4.5 follow-up) - Real-PDF fixture missing — defer to W2-2 (which exercises real Stern bulletins and JJP manuals end-to-end via the chunker) - Logger-severity drift from sibling probes — kept divergent; rationale documented inline (corpus-noise vs. operational failure modes) Build: 0 warnings, 0 errors. Tests: 732 / 732 (was 724; +8 new PdfPig extractor tests).

+    // back through the extractor.
+    private static byte[] BuildPdfWithText(string text)
+    {
+        var builder = new PdfDocumentBuilder();


+
+    private static byte[] BuildPdfWithPages(params string[] pageTexts)
+    {
+        var builder = new PdfDocumentBuilder();


+        // This is the synthetic equivalent of a scanned-image-only PDF
+        // where PdfPig parses successfully but yields no extractable
+        // text — the OcrRequired heuristic branch.
+        var builder = new PdfDocumentBuilder();


jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026

jkeeley2073 merged commit 021d503 into main May 8, 2026
4 checks passed

github-advanced-security AI found potential problems May 8, 2026

View reviewed changes

jkeeley2073 mentioned this pull request May 8, 2026

fix(rag) PdfPig input-size guard + hoist OcrRequiredCharFloor to options #108

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag) Phase 4 W1-5 — PdfPig wrapper + text-extraction service#104

feat(rag) Phase 4 W1-5 — PdfPig wrapper + text-extraction service#104
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4W15PdfPigExtractor

jkeeley2073 commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkeeley2073 commented May 8, 2026

Summary

What ships

Test Plan

Out of Scope

Checklist

Pre-push self-audit

Step 0 — /local-review (qualitative)

Step 1 — Mechanical checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Step 0 — `/local-review` (qualitative)