Skip to content

feat(rag) Phase 4 W1-5 — PdfPig wrapper + text-extraction service#104

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4W15PdfPigExtractor
May 8, 2026
Merged

feat(rag) Phase 4 W1-5 — PdfPig wrapper + text-extraction service#104
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4W15PdfPigExtractor

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Phase 4 W1-5 per docs/build-spec.md § Phase 4 scope item 14. Pure unit-testable; no Cosmos / AI Search deps. Sequenced before W2-2 (hybrid chunker per ADR-0019) which consumes IDocumentTextExtractor for per-page text + outline.

This is the second-to-last Wave 1 PR (only the W1-4 H1 operational hand-off remains, which is operator-side, not a PR).

What ships

  • IDocumentTextExtractor abstraction in Application/Rag/Extraction/. ExtractAsync(Stream, CancellationToken) returns ExtractedDocument with per-page text, an outline list, and an ExtractionStatus enum (Success / OcrRequired / Encrypted / Malformed). Failure modes surface as Status values rather than exceptions — the W3-2 Cosmos Change Feed Function can log+skip without try/catch.
  • ExtractedDocument record with ExtractedPage (PageNumber + Text) and OutlineEntry (Title + PageNumber + Level) nested records. Per-page PageNumber preserved from PdfPig's 1-based page.Number for ADR-0019's page-anchor citations + ADR-0021's page_start/page_end index fields.
  • PdfPigDocumentTextExtractor in Infrastructure/Rag/Extraction/. Wraps UglyToad.PdfPig 0.1.14 (NuGet package PdfPig). Single try/catch wraps PdfDocument.Open AND the entire using (document) body — page enumeration, page.Text access, TryGetBookmarks. PdfPig is known to throw mid-stream on malformed-but-openable PDFs; the wider catch keeps the structured-result-on-failure contract intact.
  • AddPdfDocumentTextExtractor DI extension (Singleton; extractor is stateless + thread-safe).
  • PdfPig 0.1.14 NuGet package added to Infrastructure csproj. Pure-managed (no native deps); compatible with the showcase posture (no GPL/AGPL like iText 7 community).
  • 8 unit tests with programmatic fixture PDFs generated via PdfPig's own writer — no committed binary blobs. Coverage: ctor null-arg guard, null stream, malformed bytes, empty stream, success-with-text, multi-page PageNumber preservation, no-text PDF → OcrRequired heuristic, cancellation propagation.

Test Plan

  • dotnet build PinballWizard.slnx0 warnings, 0 errors
  • dotnet test PinballWizard.slnx732 / 732 passing (was 724; +8 new PdfPigDocumentTextExtractorTests)
  • Sibling-diff against Integrations/{Foundry,AiSearch}/ smoke probes — see Local review section below
  • Real-PDF fixture validation — deferred to W2-2 (chunker tests exercise real Stern bulletins / JJP manuals end-to-end through the extractor)

Out of Scope

  • OCR fallback for OcrRequired PDFs — per Phase 4 § Deferred features, Phase 4.5 makes the OCR-vs-defer decision (Azure Document Intelligence vs. accepting a coverage gap).
  • Real-PDF fixture suite — local review flagged this as a minor item; deferred to W2-2 where the chunker integration tests will exercise real Stern bulletins, JJP manuals, and CGC remake docs end-to-end.
  • PdfExtractionOptions for the OcrRequiredCharFloor — local review flagged this hardcoded magic number; deferred to Phase 4.x as a hoist-to-options follow-up. v1 hardcoded value (32 chars) is conservative for the curated subset.
  • Input-size guards (zip-bomb / multi-GB PDF protection) — local review flagged this as a Phase 4.5 corpus-expansion concern; v1 trusts the curated subset's bounded size.
  • DI wiring in CLI Program.csAddPdfDocumentTextExtractor is the public extension; the CLI doesn't yet consume it because the extractor's first caller is W2-2's chunker (not the CLI). Wire-up happens in the PR that adds the chunker.

Checklist

  • CI is green (build + test + coverage + CodeQL + sanitization) — pre-push verified locally
  • PR title follows the Conventional Commits format
  • If this is a new architectural decision, an ADR has been added — N/A (W1-5 implements ADR-0019's prerequisite extraction step; no new architectural decision)
  • If user-visible behavior changes, README.md and/or docs/ are updated — N/A (Application abstraction; no end-user-visible surface; W2-2 + W3-2 are the consuming PRs that will surface user-visible behavior)
  • If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed — N/A (the wave 0 close + cleanup handoffs reference W1-5 as the next item; they remain accurate)
  • No TODO / FIXME / commented-out code committed
  • No new entries in <NoWarn> without a comment — N/A

Pre-push self-audit

Step 0 — /local-review (qualitative)

  • Ran /local-review and addressed every 🔴 finding before push
  • Local review outcome: 1 🔴 (fixed pre-push) / 4 ⚠️ (1 fixed inline, 3 deferred-with-justification) / 5 categories ✅
    • 🔴 (fixed): The using (document) block was originally outside the try/catch — exceptions thrown during page enumeration or outline extraction (after PdfDocument.Open succeeds) would have bypassed the IDocumentTextExtractor structured-result-on-failure contract. Fixed by widening the try/catch to wrap the entire using (document) body. Now any PdfPig exception mid-stream is caught and classified as Malformed per the contract.
    • ⚠️ Round 2 polish: README, HTTP retry, cross-source linking, DOM research #1 (fixed inline): Logger-severity drift from sibling probes. PdfPigDocumentTextExtractor logs at LogWarning for Encrypted + Malformed; AzureFoundrySmokeProbe / AzureAiSearchSmokeProbe log at LogError. Kept divergent (encrypted/malformed PDFs are expected during noisy-corpus ingestion; smoke probe failures are operational). Rationale documented inline at PdfPigDocumentTextExtractor.cs:54-62 so a future contributor doesn't "fix" the divergence away.
    • ⚠️ Replace hand-rolled HTTP retry with Microsoft.Extensions.Http.Resilience #2 (deferred): OcrRequiredCharFloor=32 as magic number. Acceptable for Phase 4's curated subset (~10 PDFs, all known-good modern manuals). Hoist-to-options is a Phase 4.x follow-up when corpus expansion exposes edge cases. Comment-noted inline at PdfPigDocumentTextExtractor.cs:25-30.
    • ⚠️ Reframe project principle: enterprise-quality portfolio bar #3 (deferred): Unbounded input size. Phase 4 curated subset is bounded; v1 trusts the input. Add a MaxStreamBytes guard before Phase 4.5 corpus expansion.
    • ⚠️ Enterprise quality bar: build hardening, CI/CD gates, security tooling, integration tests #4 (deferred): Real-PDF fixture suite. Programmatic fixtures via PdfPig's own writer test PdfPig against itself, which is acceptable for structural assertions but doesn't catch Adobe / InDesign / 1990s-typesetter quirks. W2-2 chunker integration tests will exercise real PDFs end-to-end.

Step 1 — Mechanical checklist

  • Every new *Options property has at least one real getter call in src/ — N/A (no *Options added; the extractor is configuration-free; OcrRequiredCharFloor is private const)
  • Sibling-diffed against AzureFoundrySmokeProbe + AzureAiSearchSmokeProbe (see local-review category 4): identical ctor null-checks (ArgumentNullException.ThrowIfNull), identical TryAddSingleton DI registration shape, identical structured-result-on-failure contract pattern. Drift: logger severity (justified inline; not a bug); structured-result shape couples Status enum + Error string vs. the probes' flat Success+payload (justified — extraction has 4 distinct failure modes vs. binary smoke probe).
  • No bare catch { } — all catches scoped to specific exception types or catch (Exception ex) when (ex is not OperationCanceledException)
  • New ISourceScraper? — N/A
  • Tests assert behavior, not just structure — page-number sequencing test asserts Pages[0].PageNumber == 1, Pages[1].PageNumber == 2, Pages[2].PageNumber == 3 against a 3-page PDF; would fail if a buggy implementation re-numbered pages from 0. Multi-page text content test asserts each page's expected text appears in the corresponding page's .Text.
  • Build is zero-warning — verified 0 Warning(s), 0 Error(s)
  • git log -1 --format='%an <%ae>' shows personal noreply, not work email — confirmed Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>

Phase 4 W1-5 per docs/build-spec.md § Phase 4 scope item 14. Pure
unit-testable; no Cosmos / AI Search deps. Sequenced before W2-2
(hybrid chunker, ADR-0019) which consumes IDocumentTextExtractor for
per-page text + outline.

What this PR ships:

- New `IDocumentTextExtractor` abstraction in
  `src/PinballWizard.Application/Rag/Extraction/`. ExtractAsync takes
  a Stream + CancellationToken, returns ExtractedDocument with
  per-page Text, OutlineEntry list, Pages list, and an
  ExtractionStatus enum (Success / OcrRequired / Encrypted /
  Malformed). Failure modes surface as Status values rather than
  exceptions so the W3-2 Cosmos Change Feed Function can log+skip
  without try/catch.

- ExtractedDocument record (with ExtractedPage + OutlineEntry nested
  records) and ExtractionStatus enum. Per-page PageNumber preserved
  from PdfPig's 1-based page.Number for ADR-0019's page-anchor
  citations + ADR-0021's `page_start`/`page_end` index fields.

- New `PdfPigDocumentTextExtractor` in
  `src/PinballWizard.Infrastructure/Rag/Extraction/`. Wraps
  UglyToad.PdfPig 0.1.14 (the PdfPig NuGet package; UglyToad.PdfPig
  is the namespace). Single try/catch wraps PdfDocument.Open AND
  every operation that touches the document — page enumeration,
  page.Text access, TryGetBookmarks. PdfPig is known to throw
  mid-stream on malformed-but-openable PDFs (truncated content
  streams, invalid font references, broken xref tables only
  surfacing at content time); the wider catch ensures the
  IDocumentTextExtractor structured-result-on-failure contract holds.

- AddPdfDocumentTextExtractor DI extension (Singleton; extractor is
  stateless + thread-safe).

- New PdfPig 0.1.14 NuGet package added to Infrastructure csproj.
  Pure-managed (no native deps); compatible with the showcase posture
  (no GPL/AGPL like iText 7 community license).

- 8 unit tests with programmatic fixture PDFs generated via PdfPig's
  own writer (UglyToad.PdfPig.Writer.PdfDocumentBuilder) — keeps the
  test suite self-contained without committing binary blobs. Coverage:
  ctor null-arg guard, null stream, malformed bytes, empty stream,
  success-with-text, multi-page page-number preservation, no-text PDF
  → OcrRequired (heuristic floor 32 chars), cancellation propagation.

OcrRequiredCharFloor=32 is hardcoded for Phase 4's curated subset
(~10 PDFs, all known-good modern manuals); revisit-as-options when
the Phase 4.5 corpus expansion exposes edge cases (e.g., one-line
service bulletins that are legitimate but short). Comment-noted
inline at PdfPigDocumentTextExtractor.cs:25-30.

Logger severity is LogWarning (not LogError) for Encrypted +
Malformed branches because both are expected outcomes during
ingestion of a noisy real-world corpus — distinct posture from the
Foundry / AI Search smoke probes which log at LogError because their
failure surface IS operational. Comment-noted inline.

Local review: 1 🔴 (fixed pre-push: widened the try/catch to wrap
the entire `using (document)` body; the original structure had
`using` outside the try, so exceptions thrown during page
enumeration or outline extraction would bypass the structured-
result-on-failure contract). 4 ⚠️ addressed inline:
  - OcrRequiredCharFloor as magic number — hoist-as-options deferred
    to Phase 4.x with comment-noted rationale
  - Unbounded input size — defer to Phase 4.5 corpus expansion
    (curated subset is bounded; documented as Phase 4.5 follow-up)
  - Real-PDF fixture missing — defer to W2-2 (which exercises real
    Stern bulletins and JJP manuals end-to-end via the chunker)
  - Logger-severity drift from sibling probes — kept divergent;
    rationale documented inline (corpus-noise vs. operational
    failure modes)

Build: 0 warnings, 0 errors. Tests: 732 / 732 (was 724; +8 new
PdfPig extractor tests).
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 8, 2026
@jkeeley2073 jkeeley2073 merged commit 021d503 into main May 8, 2026
4 checks passed
// back through the extractor.
private static byte[] BuildPdfWithText(string text)
{
var builder = new PdfDocumentBuilder();

private static byte[] BuildPdfWithPages(params string[] pageTexts)
{
var builder = new PdfDocumentBuilder();
// This is the synthetic equivalent of a scanned-image-only PDF
// where PdfPig parses successfully but yields no extractable
// text — the OcrRequired heuristic branch.
var builder = new PdfDocumentBuilder();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants