feat(rag) Phase 4 W1-5 — PdfPig wrapper + text-extraction service#104
Merged
Conversation
Phase 4 W1-5 per docs/build-spec.md § Phase 4 scope item 14. Pure unit-testable; no Cosmos / AI Search deps. Sequenced before W2-2 (hybrid chunker, ADR-0019) which consumes IDocumentTextExtractor for per-page text + outline. What this PR ships: - New `IDocumentTextExtractor` abstraction in `src/PinballWizard.Application/Rag/Extraction/`. ExtractAsync takes a Stream + CancellationToken, returns ExtractedDocument with per-page Text, OutlineEntry list, Pages list, and an ExtractionStatus enum (Success / OcrRequired / Encrypted / Malformed). Failure modes surface as Status values rather than exceptions so the W3-2 Cosmos Change Feed Function can log+skip without try/catch. - ExtractedDocument record (with ExtractedPage + OutlineEntry nested records) and ExtractionStatus enum. Per-page PageNumber preserved from PdfPig's 1-based page.Number for ADR-0019's page-anchor citations + ADR-0021's `page_start`/`page_end` index fields. - New `PdfPigDocumentTextExtractor` in `src/PinballWizard.Infrastructure/Rag/Extraction/`. Wraps UglyToad.PdfPig 0.1.14 (the PdfPig NuGet package; UglyToad.PdfPig is the namespace). Single try/catch wraps PdfDocument.Open AND every operation that touches the document — page enumeration, page.Text access, TryGetBookmarks. PdfPig is known to throw mid-stream on malformed-but-openable PDFs (truncated content streams, invalid font references, broken xref tables only surfacing at content time); the wider catch ensures the IDocumentTextExtractor structured-result-on-failure contract holds. - AddPdfDocumentTextExtractor DI extension (Singleton; extractor is stateless + thread-safe). - New PdfPig 0.1.14 NuGet package added to Infrastructure csproj. Pure-managed (no native deps); compatible with the showcase posture (no GPL/AGPL like iText 7 community license). - 8 unit tests with programmatic fixture PDFs generated via PdfPig's own writer (UglyToad.PdfPig.Writer.PdfDocumentBuilder) — keeps the test suite self-contained without committing binary blobs. Coverage: ctor null-arg guard, null stream, malformed bytes, empty stream, success-with-text, multi-page page-number preservation, no-text PDF → OcrRequired (heuristic floor 32 chars), cancellation propagation. OcrRequiredCharFloor=32 is hardcoded for Phase 4's curated subset (~10 PDFs, all known-good modern manuals); revisit-as-options when the Phase 4.5 corpus expansion exposes edge cases (e.g., one-line service bulletins that are legitimate but short). Comment-noted inline at PdfPigDocumentTextExtractor.cs:25-30. Logger severity is LogWarning (not LogError) for Encrypted + Malformed branches because both are expected outcomes during ingestion of a noisy real-world corpus — distinct posture from the Foundry / AI Search smoke probes which log at LogError because their failure surface IS operational. Comment-noted inline. Local review: 1 🔴 (fixed pre-push: widened the try/catch to wrap the entire `using (document)` body; the original structure had `using` outside the try, so exceptions thrown during page enumeration or outline extraction would bypass the structured- result-on-failure contract). 4⚠️ addressed inline: - OcrRequiredCharFloor as magic number — hoist-as-options deferred to Phase 4.x with comment-noted rationale - Unbounded input size — defer to Phase 4.5 corpus expansion (curated subset is bounded; documented as Phase 4.5 follow-up) - Real-PDF fixture missing — defer to W2-2 (which exercises real Stern bulletins and JJP manuals end-to-end via the chunker) - Logger-severity drift from sibling probes — kept divergent; rationale documented inline (corpus-noise vs. operational failure modes) Build: 0 warnings, 0 errors. Tests: 732 / 732 (was 724; +8 new PdfPig extractor tests).
| // back through the extractor. | ||
| private static byte[] BuildPdfWithText(string text) | ||
| { | ||
| var builder = new PdfDocumentBuilder(); |
|
|
||
| private static byte[] BuildPdfWithPages(params string[] pageTexts) | ||
| { | ||
| var builder = new PdfDocumentBuilder(); |
| // This is the synthetic equivalent of a scanned-image-only PDF | ||
| // where PdfPig parses successfully but yields no extractable | ||
| // text — the OcrRequired heuristic branch. | ||
| var builder = new PdfDocumentBuilder(); |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 4 W1-5 per
docs/build-spec.md§ Phase 4 scope item 14. Pure unit-testable; no Cosmos / AI Search deps. Sequenced before W2-2 (hybrid chunker per ADR-0019) which consumesIDocumentTextExtractorfor per-page text + outline.This is the second-to-last Wave 1 PR (only the W1-4 H1 operational hand-off remains, which is operator-side, not a PR).
What ships
IDocumentTextExtractorabstraction inApplication/Rag/Extraction/.ExtractAsync(Stream, CancellationToken)returnsExtractedDocumentwith per-page text, an outline list, and anExtractionStatusenum (Success/OcrRequired/Encrypted/Malformed). Failure modes surface as Status values rather than exceptions — the W3-2 Cosmos Change Feed Function can log+skip without try/catch.ExtractedDocumentrecord withExtractedPage(PageNumber + Text) andOutlineEntry(Title + PageNumber + Level) nested records. Per-page PageNumber preserved from PdfPig's 1-basedpage.Numberfor ADR-0019's page-anchor citations + ADR-0021'spage_start/page_endindex fields.PdfPigDocumentTextExtractorinInfrastructure/Rag/Extraction/. WrapsUglyToad.PdfPig0.1.14 (NuGet packagePdfPig). Single try/catch wrapsPdfDocument.OpenAND the entireusing (document)body — page enumeration, page.Text access, TryGetBookmarks. PdfPig is known to throw mid-stream on malformed-but-openable PDFs; the wider catch keeps the structured-result-on-failure contract intact.AddPdfDocumentTextExtractorDI extension (Singleton; extractor is stateless + thread-safe).PdfPig0.1.14 NuGet package added to Infrastructure csproj. Pure-managed (no native deps); compatible with the showcase posture (no GPL/AGPL like iText 7 community).Test Plan
dotnet build PinballWizard.slnx— 0 warnings, 0 errorsdotnet test PinballWizard.slnx— 732 / 732 passing (was 724; +8 newPdfPigDocumentTextExtractorTests)Integrations/{Foundry,AiSearch}/smoke probes — see Local review section belowOut of Scope
OcrRequiredPDFs — per Phase 4 § Deferred features, Phase 4.5 makes the OCR-vs-defer decision (Azure Document Intelligence vs. accepting a coverage gap).PdfExtractionOptionsfor theOcrRequiredCharFloor— local review flagged this hardcoded magic number; deferred to Phase 4.x as a hoist-to-options follow-up. v1 hardcoded value (32 chars) is conservative for the curated subset.AddPdfDocumentTextExtractoris the public extension; the CLI doesn't yet consume it because the extractor's first caller is W2-2's chunker (not the CLI). Wire-up happens in the PR that adds the chunker.Checklist
~/.claude/projects/c--projects-PinballWizard/memory/is now stale, it has been updated or removed — N/A (the wave 0 close + cleanup handoffs reference W1-5 as the next item; they remain accurate)TODO/FIXME/ commented-out code committed<NoWarn>without a comment — N/APre-push self-audit
Step 0 —
/local-review(qualitative)/local-reviewand addressed every 🔴 finding before pushusing (document)block was originally outside the try/catch — exceptions thrown during page enumeration or outline extraction (afterPdfDocument.Opensucceeds) would have bypassed theIDocumentTextExtractorstructured-result-on-failure contract. Fixed by widening the try/catch to wrap the entireusing (document)body. Now any PdfPig exception mid-stream is caught and classified asMalformedper the contract.LogWarningfor Encrypted + Malformed; AzureFoundrySmokeProbe / AzureAiSearchSmokeProbe log atLogError. Kept divergent (encrypted/malformed PDFs are expected during noisy-corpus ingestion; smoke probe failures are operational). Rationale documented inline atPdfPigDocumentTextExtractor.cs:54-62so a future contributor doesn't "fix" the divergence away.OcrRequiredCharFloor=32as magic number. Acceptable for Phase 4's curated subset (~10 PDFs, all known-good modern manuals). Hoist-to-options is a Phase 4.x follow-up when corpus expansion exposes edge cases. Comment-noted inline atPdfPigDocumentTextExtractor.cs:25-30.MaxStreamBytesguard before Phase 4.5 corpus expansion.Step 1 — Mechanical checklist
*Optionsproperty has at least one real getter call insrc/— N/A (no*Optionsadded; the extractor is configuration-free; OcrRequiredCharFloor isprivate const)AzureFoundrySmokeProbe+AzureAiSearchSmokeProbe(see local-review category 4): identical ctor null-checks (ArgumentNullException.ThrowIfNull), identicalTryAddSingletonDI registration shape, identical structured-result-on-failure contract pattern. Drift: logger severity (justified inline; not a bug); structured-result shape couples Status enum + Error string vs. the probes' flat Success+payload (justified — extraction has 4 distinct failure modes vs. binary smoke probe).catch { }— all catches scoped to specific exception types orcatch (Exception ex) when (ex is not OperationCanceledException)ISourceScraper? — N/APages[0].PageNumber == 1, Pages[1].PageNumber == 2, Pages[2].PageNumber == 3against a 3-page PDF; would fail if a buggy implementation re-numbered pages from 0. Multi-page text content test asserts each page's expected text appears in the corresponding page's.Text.0 Warning(s), 0 Error(s)git log -1 --format='%an <%ae>'shows personal noreply, not work email — confirmedJim Keeley <94459922+jkeeley2073@users.noreply.github.com>