Skip to content

feat(observability) OTel groundwork — Meter, ActivitySource, OPDB metrics#73

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-OtelGroundwork
May 4, 2026
Merged

feat(observability) OTel groundwork — Meter, ActivitySource, OPDB metrics#73
jkeeley2073 merged 1 commit into
mainfrom
Dev-OtelGroundwork

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Closes docs/build-spec.md Phase 2 § Scope item 5 (Wave 3, last Phase 2 scope item).

Establishes the project-wide OpenTelemetry instrumentation pattern that Phase 3 / 4 / 5 services inherit, and applies it to OpdbSyncService as the first real consumer.

What ships

  • PinballWizardTelemetry — single project-wide Meter + ActivitySource (both named "PinballWizard"). New scrapers / services add their counters and activities here under the pinwiz.<domain>.<operation>.<measure> naming convention rather than creating per-domain Meters.
  • 6 OPDB sync instruments — counters for fetched, inserted, updated, skipped, failed; histogram for duration_ms. Each carries a pinwiz.opdb.sync.mode attribute (apply | dry_run) so dashboards filter operational charts to apply-only and pre-deploy-validation charts to dry-run-only.
  • 1 ActivitySource activity (pinwiz.opdb.sync) with run-summary tags. Trace inspection alone tells the run's full story without joining against the metric stream.
  • IIngestionSourceRepository.RecordRunResultAsync + new IngestionSourceRunResult DTO — per-run state write-back (LastRunAt, LastSuccessAt, accumulators). No-ops with a logged warning if the source isn't seeded.
  • IngestionSourceIds — single source of truth for the source-id literals at write-back call sites and in the seed manifest. Phase 3 scrapers add constants here.
  • OpdbSyncService instrumented in a try/catch/finally: catch increments OpdbSyncFailed and sets ActivityStatusCode.Error; finally emits all per-run observations and writes back to ingestion_sources (apply-mode only; cancellation skips the write-back intentionally).
  • ServiceDefaults/Extensions.cs registers AddMeter("PinballWizard") + AddSource("PinballWizard"). The string literal is duplicated rather than typed-referenced to avoid a ServiceDefaults → Application project reference (would invert layering); duplication is documented in both files + observability.md.
  • docs/observability.md — full instrument inventory, Activity inventory, write-back semantics, consumer guides (Aspire dashboard / Log Analytics / Application Insights), standard tag conventions, Phase 3/4/5 extension pattern, deferred work (Cosmos RU charge → Phase 6).

Test Plan

  • dotnet test PinballWizard.slnx --nologo533 / 533 passing (was 524; +9 new tests):
    • 5 in PinballWizardTelemetryTests pinning instrument names + units (catches dashboard-breaking renames at build time)
    • 4 in OpdbSyncServiceTests: apply-mode records run result with DocumentsDiscovered; dry-run skips write-back; write-back failure doesn't mask original sync success; failure-path rethrows with Succeeded=false recorded
  • dotnet build PinballWizard.slnx --nologo → clean, zero warnings

Local review

/local-review outcome: 1 🔴 / 5 ⚠️ / 5 categories ✅.

  • 🔴 Fixed — missing failure-path test. Added SyncAsync_ApplyMode_RepositoryThrows_RecordsFailedRunResultThenRethrows using NSubstitute.ThrowsAsync to make GetByOpdbIdAsync throw, asserts RecordRunResultAsync was called with Succeeded=false despite the exception, and verifies the exception propagates. Load-bearing because operators look at the dashboard precisely when runs are failing.
  • ⚠️ FixedIngestionSourceIds.Opdb const replaces magic-string "opdb" in OpdbSyncService; Phase 3 scrapers add their constants here.
  • ⚠️ Fixed — Cancellation comment expanded to make the trade-off explicit (cancelled run skips write-back; remediation noted inline).
  • ⚠️ Fixeddocs/observability.md footnote: destination Log Analytics table differs (AppMetrics vs customMetrics) depending on OTLP ingestion path.
  • ⚠️ Deferred — Sibling-drift note about per-service DocumentsDiscovered math: already covered by observability.md § "Adding new instruments".
  • ⚠️ Deferred — Drift test for ServiceDefaults literal vs PinballWizardTelemetry constant: comment documents the duplication in 3 places already; accepted risk.

7-item self-audit

  1. Every option field is read — N/A (no new options)
  2. Sibling-diff — diffed against IngestionSourceSeeder (the closest sibling); both use read-merge-upsert against IIngestionSourceRepository with consistent error semantics
  3. No bare catch { } — all catches are scoped (OperationCanceledException when ..., Exception ex with logging)
  4. CLI / orchestrator wiring — no CLI changes; existing --source opdb already dispatches through IOpdbSyncService which now resolves with the new constructor parameter via DI (verified — AddCosmosPersistence registers IIngestionSourceRepository)
  5. Tests assert behavior — failure-path test exercises real exception propagation; apply-mode test verifies actual RecordRunResultAsync call args; dry-run test verifies DidNotReceive
  6. Build is zero-warning — verified
  7. Identity checkgit log -1 --format='%an <%ae>'Jim Keeley <94459922+jkeeley2073@users.noreply.github.com>

Out of Scope

  • Cosmos RU charge capture (pinwiz.cosmos.write.ru_charge) — deferred to Phase 6 (operability) per observability.md § "Deferred to later phases". Best designed against real production traffic.
  • Per-scraper instrumentation (pinwiz.scrape.<source>.*) — Phase 3+, when manufacturer scrapers gain ACA Job execution.
  • Direct IngestionSourceRepository.RecordRunResultAsync repository tests — would require Cosmos Container mocking (~40 lines per case); deferred. The OpdbSyncService tests verify the contract is invoked correctly, which exercises the same logic at the integration boundary.
  • Phase 2 closeout — Phase 2 § Scope items 1–9 are now all complete with this PR. The phase retrospective + memory update happen in a follow-up after this merges.

🤖 Generated with Claude Code

…rics

Phase 2 § Scope item 5 (Wave 3, last Phase 2 scope item). Establishes
the project-wide OpenTelemetry instrumentation pattern that Phase 3 / 4 /
5 services inherit, and applies it to OpdbSyncService as the first
real consumer.

What ships:

- src/PinballWizard.Application/Observability/PinballWizardTelemetry.cs —
  single project-wide Meter ("PinballWizard") and ActivitySource
  ("PinballWizard"). All instruments live here under the
  pinwiz.<domain>.<operation>.<measure> naming convention. New scrapers
  / services add their counters/activities alongside the existing OPDB
  set rather than creating per-domain Meters.
- 6 OPDB sync instruments: Counter<long> for fetched / inserted /
  updated / skipped / failed; Histogram<double> for duration_ms. All
  carry a `pinwiz.opdb.sync.mode` attribute (apply | dry_run) so
  dashboards filter operational charts to apply-only and validation
  charts to dry-run-only.
- 1 ActivitySource activity name (pinwiz.opdb.sync) with run-summary
  tags for fetched / inserted / updated / skipped / duration_ms.
- IIngestionSourceRepository.RecordRunResultAsync(sourceId, result, ct)
  + IngestionSourceRunResult DTO. Updates LastRunAt; sets
  LastSuccessAt on success (preserves on failure); accumulates
  TotalDocumentsDiscovered; increments TotalRunFailures on failure.
  No-ops with a logged warning if the source isn't seeded yet.
- IngestionSourceIds static class — single source of truth for the
  source-id literals used in the seed manifest and at write-back
  call sites. Phase 3 manufacturer scrapers add constants here.
- OpdbSyncService instrumented in a try/catch/finally:
  - try block runs the existing fetch-and-upsert loop unchanged
  - catch block increments OpdbSyncFailed counter, sets the
    ActivityStatusCode.Error tag, and rethrows
  - finally block emits all per-run counter/histogram observations
    AND calls RecordRunResultAsync (apply-mode only; dry-run skips
    operator-visible state). Cancellation skips the write-back so
    operator-driven Ctrl-C / ACA shutdown doesn't pollute failure
    counters; documented in code.
- ServiceDefaults/Extensions.cs registers AddMeter("PinballWizard") +
  AddSource("PinballWizard") with OTel. The string literal is
  duplicated rather than typed-referenced to avoid a ServiceDefaults →
  Application project reference (would invert layering); duplication
  documented in both files + docs/observability.md.
- docs/observability.md — full instrument inventory; Activity inventory;
  IngestionSource write-back semantics; how to consume in Aspire
  dashboard / Log Analytics / Application Insights; standard tag
  conventions; Phase 3/4/5 extension pattern; deferred work
  (Cosmos RU charge capture → Phase 6).

Local review summary: 1 🔴 / 5 ⚠️ / 5 categories ✅.

🔴 — Fixed: missing failure-path test. Added
SyncAsync_ApplyMode_RepositoryThrows_RecordsFailedRunResultThenRethrows
which uses NSubstitute.ThrowsAsync to make the GetByOpdbIdAsync call
throw, and asserts that RecordRunResultAsync was called with
Succeeded=false despite the exception, plus verifies the exception
propagates. This is the load-bearing path — operators look at the
dashboard precisely when runs are failing.

⚠️ — Fixed:
- IngestionSourceIds.Opdb const replaces magic-string "opdb" in
  OpdbSyncService; Phase 3 scrapers add constants here, single source
  of truth for the seed-manifest / write-back contract.
- Cancellation comment expanded in OpdbSyncService.cs to make the
  trade-off explicit (cancelled run skips write-back; intentional;
  remediation noted inline if a future operator wants cancelled runs
  recorded as failures).
- docs/observability.md — added a footnote documenting that the
  destination Log Analytics table differs (AppMetrics vs customMetrics)
  depending on OTLP ingestion path (direct LA vs Application Insights).

⚠️ — Deferred:
- Sibling-drift "DocumentsDiscovered = inserted + updated" math note —
  observability.md § "Adding new instruments" already covers per-service
  accounting; explicit OPDB-specific note would add noise without
  value.
- Drift test for ServiceDefaults literal vs PinballWizardTelemetry
  constant — comment already documents the duplication in 3 places
  (PinballWizardTelemetry.cs, Extensions.cs, observability.md);
  accepted risk.

Tests: 533 / 533 passing (was 524; +9: 5 telemetry name pins + 4
OpdbSyncService instrumentation behavior tests). Build clean, zero
warnings.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 4, 2026
Comment on lines +182 to +190
catch (Exception writeBackEx)
{
_logger.LogError(
writeBackEx,
"OPDB sync completed{State} but recording the run result on " +
"ingestion_sources failed; the source's lastRunAt / counters may " +
"lag by one run.",
failure is null ? string.Empty : " with errors");
}
MachineJson("GRBN-MQR4P", manufacturer: "Stern Pinball, Inc.", name: "Stranger Things (Pro)", commonName: "Stranger Things"),
MachineJson("XYZ", manufacturer: "Jersey Jack Pinball", name: "Wonka", commonName: "Wonka")));

_repository.GetByOpdbIdAsync(Arg.Any<string>(), Arg.Any<string>(), Arg.Any<CancellationToken>()).Returns((Machine?)null);
_handler.SetResponseFor("/api/machines?page=1&page_size=100", JsonArray(
MachineJson("GRBN-MQR4P", manufacturer: "Stern Pinball, Inc.", name: "Stranger Things (Pro)", commonName: "Stranger Things")));

_repository.GetByOpdbIdAsync(Arg.Any<string>(), Arg.Any<string>(), Arg.Any<CancellationToken>()).Returns((Machine?)null);
{
_handler.SetResponseFor("/api/machines?page=1&page_size=100", JsonArray(
MachineJson("GRBN-MQR4P", manufacturer: "Stern Pinball, Inc.", name: "Stranger Things (Pro)", commonName: "Stranger Things")));
_repository.GetByOpdbIdAsync(Arg.Any<string>(), Arg.Any<string>(), Arg.Any<CancellationToken>()).Returns((Machine?)null);
Comment on lines +182 to +190
catch (Exception writeBackEx)
{
_logger.LogError(
writeBackEx,
"OPDB sync completed{State} but recording the run result on " +
"ingestion_sources failed; the source's lastRunAt / counters may " +
"lag by one run.",
failure is null ? string.Empty : " with errors");
}
@jkeeley2073 jkeeley2073 merged commit c211b07 into main May 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants