Skip to content

feat(cli) --seed-ingestion-sources reads JSON manifest, upserts to Cosmos#69

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-SeedIngestionSources
May 4, 2026
Merged

feat(cli) --seed-ingestion-sources reads JSON manifest, upserts to Cosmos#69
jkeeley2073 merged 1 commit into
mainfrom
Dev-SeedIngestionSources

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Closes docs/build-spec.md Phase 2 § Scope item 3 (Wave 1).

New CLI flag bootstraps the ingestion_sources Cosmos container from a checked-in JSON manifest, with idempotent read-merge-upsert semantics that preserve runtime fields populated by actual scraper runs (LastRunAt, LastSuccessAt, totalDocumentsDiscovered, totalRunFailures). Re-running the seeder with no manifest changes produces no semantic diff; manifest config-field changes apply without disturbing telemetry.

What ships

  • src/PinballWizard.Application/Sync/IIngestionSourceSeeder interface + IngestionSourceSeedResult record + IngestionSourceSeed DTO + IngestionSourceSeeder implementation. Application-layer service depending only on IIngestionSourceRepository and IngestionSource. The DTO intentionally omits runtime fields (defense-in-depth: even an accidentally-populated manifest can't blow them away on re-seed).
  • data/seeds/ingestion_sources.v1.json — 9 manufacturer entries (stern, jjp, ap, spooky, pinballbrothers, barrelsoffun, multimorphic, cgc, opdb) with cadences from project_phase2_architecture_decisions.md.
  • src/PinballWizard.Cli/Program.cs — wires --seed-ingestion-sources with the exit-code-2-when-Cosmos-not-configured pattern matching --ensure-cosmos-containers. Seeder DI registration is gated alongside AddCosmosPersistence so the service cannot resolve without its repository.
  • .gitignore — adjusts the data/ rule to data/* plus !data/seeds/ / !data/seeds/** so the seed manifest is tracked while runtime output (downloads / metadata / logs / snapshots / history) stays ignored. The comment block explains why the pattern is data/* not data/ (git cannot re-include children of an excluded parent directory).
  • tests/PinballWizard.Scraper.Tests/Sync/IngestionSourceSeederTests.cs — 8 tests covering happy path, the load-bearing idempotency assertion (re-run preserves runtime fields), duplicate-id rejection, missing-file, empty-array, malformed-JSON, cancellation, and a production-manifest sanity check that pins the on-disk JSON deserializes cleanly with the expected 9 ids.

Setup required after merge

  1. Local Aspire path: pwsh ./start-apphost.ps1 → set ConnectionStrings__cosmos from the dashboard → dotnet run --project src/PinballWizard.Cli -- --seed-ingestion-sources
  2. Deployed-Cosmos path: set Cosmos__AccountEndpoint + Cosmos__AccountResourceId (PowerShell, not Git-Bash) → run the same command
  3. Verify Cosmos ingestion_sources container has 9 documents matching the manifest (Cosmos Data Explorer or az cosmosdb sql container show)

Test Plan

  • dotnet test PinballWizard.slnx --nologo515 / 515 passing (was 507 pre-PR; +8 new seeder tests)
  • dotnet build PinballWizard.slnx --nologo → clean, zero warnings under TreatWarningsAsErrors
  • The ProductionManifest_DeserializesCleanlyAndContainsNineEntries test pins the on-disk JSON; manifest schema regressions fail at test time, not at runtime in production

Local review

/local-review ran against the diff:

  • 1 🔴 finding (fixed) — manifest's CGC entry used "chicagogaming" for both id and scraperImplKey, but the canonical key throughout the codebase is "cgc" (ScraperManufacturerKey.ChicagoGaming, OpdbMachineMapper normalization, ScraperOrchestrator.SourceAliases, the existing --source cgc CLI filter). Manifest + test fixture both updated to "cgc" before push.
  • 3 ⚠️ categories (2 sub-findings fixed, 3 deferred with justifications):
    • Fixed: added SeedAsync_PreCancelledToken_* test for cancellation propagation; added a terminal LogInformation summary in the seeder matching the sibling pattern in OpdbSyncService and ScraperReconciliationService.
    • Deferred — per-record try/catch on UpsertAsync failures: sibling OpdbSyncService has the same shape; the manifest is 9 rows and the seeder is idempotent, so a transient mid-loop failure is recovered by re-run. Revisit if the manifest grows substantially or Phase 6 operability work warrants partial-success counters.
    • Deferred — Total field naming nuance ("manifest count" vs "processed count"): the early-throw path doesn't return a result, so the distinction is academic; documented inline.
    • Deferred — ProductionManifest_DeserializesCleanlyAndContainsNineEntries doesn't follow SeedAsync_<state>_<expectation>: the test's subject is the on-disk file, not a method on the seeder; a Method_State_Expectation rename would obscure that.

7-item self-audit

  1. Every option field is read — the new IngestionSourceSeed DTO has 7 fields, all read by the seeder (verified by grep): Id, DisplayName, ScraperImplKey, BaseUrl, Enabled, Cadence, PolitenessOverrides. ✅
  2. Sibling-diff — diffed against OpdbSyncService and ScraperReconciliationService per local-review category Enterprise quality bar: build hardening, CI/CD gates, security tooling, integration tests #4; sibling drift was the source of the 🔴 (CGC key) and one ⚠️ (terminal log line). ✅
  3. No bare catch { } — only catch is catch (JsonException ex) at the deserialize call; rethrows as InvalidOperationException with context. Test cleanup catches IOException only. ✅
  4. CLI / orchestrator wiring--seed-ingestion-sources parsed and dispatched in Program.cs; IIngestionSourceSeeder registered in DI inside the cosmosWired block. Manual: dotnet run --project src/PinballWizard.Cli -- --help shows the new flag in the output. ✅
  5. Tests assert behavior — re-run idempotency test pre-populates TotalDocumentsDiscovered = 1234, TotalRunFailures = 7, etc., and explicitly asserts each survives. Duplicate-ids test feeds duplicates. Cancellation test pre-cancels the token. ✅
  6. Build is zero-warningdotnet build PinballWizard.slnx --nologo → 0 warnings under TreatWarningsAsErrors. ✅
  7. Identity checkgit log -1 --format='%an <%ae>'Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>

Out of Scope

  • Operational hand-off (running the seeder against real / deployed Cosmos) — happens after merge as a small ops task, not in this PR. Captured against Phase 2 § Exit criteria in the phase retrospective.
  • Item 4 (OPDB sync against deployed Cosmos with --dry-run pre-req) — Wave 2; a separate PR.
  • Items 5, 6, 7-round-2, 8 — sequenced per Phase 2 § Parallelism plan.

🤖 Generated with Claude Code

…smos

Phase 2 § Scope item 3 (Wave 1). New CLI flag bootstraps the
ingestion_sources Cosmos container from a checked-in JSON manifest, with
idempotent read-merge-upsert semantics that preserve runtime fields
populated by actual scraper runs (LastRunAt, LastSuccessAt,
totalDocumentsDiscovered, totalRunFailures).

What ships:

- src/PinballWizard.Application/Sync/IIngestionSourceSeeder.cs +
  IngestionSourceSeeder.cs + IngestionSourceSeed.cs — Application-layer
  service taking a manifest path, upserting one row per entry. DTO
  intentionally omits runtime fields (defense-in-depth: even if a
  manifest accidentally carried them, only config fields would apply).
- data/seeds/ingestion_sources.v1.json — 9 entries (stern, jjp, ap,
  spooky, pinballbrothers, barrelsoffun, multimorphic, cgc, opdb) with
  cadences from project_phase2_architecture_decisions.md.
- src/PinballWizard.Cli/Program.cs — wires --seed-ingestion-sources
  flag with the same exit-code-2-when-Cosmos-not-configured pattern as
  --ensure-cosmos-containers. DI registration of IIngestionSourceSeeder
  is gated alongside AddCosmosPersistence so it cannot resolve without
  its repository.
- .gitignore — adjusts data/ pattern to data/* with !data/seeds/
  re-include so the seed manifest is tracked while runtime output
  (downloads, metadata, logs, snapshots, history) stays ignored. The
  comment block explains why the pattern is data/* not data/ (git
  cannot re-include children of an excluded parent directory).
- tests/PinballWizard.Scraper.Tests/Sync/IngestionSourceSeederTests.cs
  — 8 tests: first-run inserts with zero runtime fields; re-run applies
  config and preserves runtime fields (load-bearing); duplicate ids
  throw and roll back; missing manifest throws FileNotFoundException;
  empty array returns zero counts; malformed JSON throws
  InvalidOperationException; pre-cancelled token throws
  OperationCanceledException without upserting; production manifest
  on disk deserializes and pins the 9 expected ids.

Local review: 1 🔴 finding (fixed); 3 ⚠️ categories (2 sub-findings
fixed, 3 deferred with justifications below).

🔴 — Fixed: data/seeds/ingestion_sources.v1.json originally used
"chicagogaming" for both id and scraperImplKey, but the canonical CGC
key throughout the codebase is "cgc" (ScraperManufacturerKey,
OpdbMachineMapper normalization, ScraperOrchestrator.SourceAliases,
existing --source cgc CLI filter). Manifest + test fixture both
updated.

⚠️ — Fixed: cancellation propagation test added
(SeedAsync_PreCancelledToken_ThrowsOperationCanceledExceptionWithoutUpserting);
terminal LogInformation summary added to the seeder for sibling
consistency with OpdbSyncService.

⚠️ — Deferred (with justifications):
- Per-record try/catch on UpsertAsync failures: deferred. Sibling
  OpdbSyncService has the same shape; the manifest is 9 rows and the
  seeder is idempotent, so a transient mid-loop failure is recovered
  by re-run. Worth revisiting if the manifest grows substantially or
  if Phase 6 operability work warrants partial-success counters.
- Total field naming nuance (means "manifest count" not "processed
  count"): deferred. The early-throw path doesn't return a result,
  so the distinction is academic; the field is documented inline.
- ProductionManifest_DeserializesCleanlyAndContainsNineEntries
  doesn't follow the SeedAsync_<state>_<expectation> naming pattern:
  deferred. The subject is the on-disk file, not a method on the
  seeder; a Method_State_Expectation rename would obscure that.

Full test run: 515 / 515 passing (was 507 pre-PR; +8 seeder tests).
Build clean, zero warnings.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 4, 2026
Seed("jjp", "Jersey Jack", "jjp", "https://www.jerseyjackpinball.com/", true, "daily"));

_repo.GetByIdAsync(Arg.Any<string>(), "config", Arg.Any<CancellationToken>())
.Returns((IngestionSource?)null);
Seed("stern", "Stern Pinball", "stern", "https://sternpinball.com/", true, "daily"));

_repo.GetByIdAsync(Arg.Any<string>(), "config", Arg.Any<CancellationToken>())
.Returns((IngestionSource?)null);
return;
}

var manifestPath = Path.Combine("data", "seeds", "ingestion_sources.v1.json");
e.Id == "stern"
&& e.PartitionKey == "config"
&& e.DisplayName == "Stern Pinball"
&& e.Enabled == true
[Fact]
public async Task SeedAsync_MissingManifestFile_ThrowsFileNotFoundException()
{
var nonexistent = Path.Combine(Path.GetTempPath(), $"missing-{Guid.NewGuid():N}.json");
public void ProductionManifest_DeserializesCleanlyAndContainsNineEntries()
{
var repoRoot = FindRepoRoot();
var manifestPath = Path.Combine(repoRoot, "data", "seeds", "ingestion_sources.v1.json");

private string WriteRawManifest(string json)
{
var path = Path.Combine(Path.GetTempPath(), $"seed-{Guid.NewGuid():N}.json");
{
// Walk upward from the test assembly until we find the .slnx file.
var dir = new DirectoryInfo(AppContext.BaseDirectory);
while (dir is not null && !File.Exists(Path.Combine(dir.FullName, "PinballWizard.slnx")))
e.Id == "stern"
&& e.PartitionKey == "config"
&& e.DisplayName == "Stern Pinball"
&& e.Enabled == true
[Fact]
public async Task SeedAsync_MissingManifestFile_ThrowsFileNotFoundException()
{
var nonexistent = Path.Combine(Path.GetTempPath(), $"missing-{Guid.NewGuid():N}.json");
public void ProductionManifest_DeserializesCleanlyAndContainsNineEntries()
{
var repoRoot = FindRepoRoot();
var manifestPath = Path.Combine(repoRoot, "data", "seeds", "ingestion_sources.v1.json");

private string WriteRawManifest(string json)
{
var path = Path.Combine(Path.GetTempPath(), $"seed-{Guid.NewGuid():N}.json");
{
// Walk upward from the test assembly until we find the .slnx file.
var dir = new DirectoryInfo(AppContext.BaseDirectory);
while (dir is not null && !File.Exists(Path.Combine(dir.FullName, "PinballWizard.slnx")))
@jkeeley2073 jkeeley2073 merged commit e10f0d9 into main May 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants