Skip to content

feat(cosmos) PR 2: CosmosClientOptions wins + warmup + health check + ApplicationName per host#140

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4Adr0025Pr2ClientOptionsHostStartup
May 9, 2026
Merged

feat(cosmos) PR 2: CosmosClientOptions wins + warmup + health check + ApplicationName per host#140
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase4Adr0025Pr2ClientOptionsHostStartup

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

PR 2 of 6 in the Cosmos for User Delight track per ADR-0025. PR 1 (#139) shipped the ADR + 5-layer enforcement scaffolding; this PR implements the locked CosmosClientOptions posture + host-startup wiring. No behavioral change to user-facing query paths — PR 5 is the user-delight headline change.

What lands

CosmosClientOptions changes (per ADR-0025 § 2):

  • EnableContentResponseOnWrite = false — saves one round-trip + ~1 RU per write. Required pre-flight: refactored CosmosRepository<T>.UpsertAsync to return the input entity directly. Audited every IRepository<T>.UpsertAsync caller — none consume response.Resource. Updated IRepository<T> XML doc + CosmosRepositoryTests for the new contract.
  • AllowBulkExecution = true — auto-batches concurrent same-partition operations. Zero risk for current single-op call sites; meaningful win for OPDB sync (~2,400 sequential upserts) and the future Phase 1 → Cosmos backfill.

Hot-path query tuning:

  • MachineRepository.QueryByTitleAsync refactored to use Container.GetItemQueryIterator directly with MaxItemCount = 1 (the caller breaks on first match). Comment cross-references ADR-0025 § 4 bridge state — PR 5 replaces the cross-partition query with a point-read.

Host-startup posture:

  • New CosmosClientWarmupHostedServiceBackgroundService calling CosmosClient.ReadAccountAsync() at host startup to amortize the SDK's lazy-connection cost (~300-500ms) off the first user query.
  • New CosmosHealthCheckIHealthCheck probing machines container as a canary. Tagged live so /healthz reports Cosmos reachability for ACA / Aspire liveness probes. On CosmosException, captures Diagnostics into health-check data (region, retry count, RU consumed) so operators see context without a separate trace lookup.

Per-host identity:

  • Cosmos:ApplicationName = pinwiz-cli (CLI) and pinwiz-rag-worker (RagIngestionWorker). Distinguishes hosts in Cosmos diagnostics + custom-metric tagging.

Package add:

  • Microsoft.Extensions.Diagnostics.HealthChecks 10.0.7 — narrower than taking a Microsoft.AspNetCore.App framework reference; gives just the abstractions + AddCheck<T> extension the Infrastructure layer needs.

Tests (979 → 987, +8)

  • New CosmosClientOptionsTests (2 tests) — pins all 6 ADR-0025 § 2 client options via real DI registration (catches wiring drift, not just config drift). Includes the null-safe ApplicationName path pinning the SDK constraint that empty strings are rejected.
  • New CosmosHealthCheckTests (6 tests) — healthy / CosmosException with diagnostics data keys / generic exception / cancellation propagation / 2 ctor null guards.
  • Updated CosmosRepositoryTests.UpsertAsync_PassesEntityAndPartitionKey_ReturnsInputEntity — asserts Assert.Same(entity, result) AND null ETag to prove the new contract.

Pre-push self-audit

  • Qualitative review (/local-review): ✅ 0 🔴 / 0 ⚠️ / 7 categories ✅ — design / test quality / failure posture / cross-references / drift / Cosmos surface conformance / package hygiene all pass
  • 8-item mechanical audit (now includes Cosmos surface conformance chore(deps)(deps): bump marocchino/sticky-pull-request-comment from 2 to 3 #8 from PR 1): ✅ all 8 items pass — option fields read, no sibling drift, no bare catches, end-to-end DI wiring, behavior tests, zero warnings, personal identity, ADR-0025 conformance verified

Test plan

  • dotnet build PinballWizard.slnx -p:TreatWarningsAsErrors=true — 0/0 warnings
  • dotnet test PinballWizard.slnx — 987/987 passing
  • Identity check: personal noreply
  • Pre-flight audit: no IRepository<T>.UpsertAsync callers consume response.Resource
  • Operator hand-off (post-merge): verify Cosmos:ApplicationName shows up in Cosmos diagnostic logs as pinwiz-rag-worker once the Container App image is rebuilt + swapped

Track progress

  • ✅ PR 1 (docs(cosmos) ADR-0025 Cosmos for User Delight + 5-layer enforcement scaffolding #139): ADR-0025 + 5-layer enforcement scaffolding
  • ✅ PR 2 (this PR): CosmosClientOptions wins + warmup + health check + ApplicationName per host
  • ⏳ PR 3: Selective indexing policies + drift-check
  • ⏳ PR 6: TTL on rag_dead_letters
  • ⏳ PR 4: pinwiz.cosmos.* instruments + MeteredCosmosRepository<T> decorator (gates PR 5)
  • ⏳ PR 5: Title→OpdbId point-read lookup container (the user-delight headline change)

🤖 Generated with Claude Code

… ApplicationName per host

PR 2 of 6 in the Cosmos for User Delight track per ADR-0025
(plan: ~/.claude/plans/lets-take-some-time-ticklish-storm.md).
PR 1 (#139) shipped the ADR + 5-layer enforcement scaffolding;
this PR implements the locked CosmosClientOptions posture +
host-startup wiring (no behavioral change to user-facing query paths;
PR 5 is the user-delight headline change).

What lands:

- CosmosClientOptions changes (per ADR-0025 § 2):
  - EnableContentResponseOnWrite = false — saves one round-trip + ~1 RU
    per write. Required pre-flight: refactored CosmosRepository<T>.UpsertAsync
    to return the input entity directly instead of response.Resource
    (which is null when the option is off). Audited every IRepository<T>.UpsertAsync
    caller — none consume response.Resource. Updated IRepository<T>
    XML doc + the affected CosmosRepositoryTests case to reflect the
    new contract.
  - AllowBulkExecution = true — auto-batches concurrent same-partition
    operations. Zero risk for current single-op call sites; meaningful
    win for OPDB sync (~2,400 sequential upserts) and the future
    Phase 1 → Cosmos backfill.
- MachineRepository.QueryByTitleAsync: refactored to use
  Container.GetItemQueryIterator directly with MaxItemCount = 1 (the
  caller breaks on first match). Comment cross-references ADR-0025 § 4
  bridge state — PR 5 replaces the cross-partition query with a
  point-read against machine_title_lookups.
- New CosmosClientWarmupHostedService (BackgroundService) — calls
  CosmosClient.ReadAccountAsync() at host startup to amortize the
  SDK's lazy-connection cost (~300-500ms) off the first user query.
  Failure logs Warning, not throw (warmup is a latency optimization,
  not a hard dependency).
- New CosmosHealthCheck (IHealthCheck) — probes machines container
  via Container.ReadContainerAsync as a canary. Tagged 'live' so
  /healthz reports Cosmos reachability for ACA / Aspire liveness
  probes. On CosmosException, captures Diagnostics into health-check
  data so operators see region + retry count + RU consumed without
  a separate trace lookup.
- Cosmos:ApplicationName per host: pinwiz-cli (CLI) + pinwiz-rag-worker
  (RagIngestionWorker) — distinguishes hosts in Cosmos diagnostics
  + custom-metric tagging without changing client behavior.
- New Microsoft.Extensions.Diagnostics.HealthChecks 10.0.7 package
  in Directory.Packages.props + Infrastructure.csproj. Narrower than
  taking a Microsoft.AspNetCore.App framework reference; gives just
  the IHealthCheck abstractions + AddCheck<T> registration extension
  the Infrastructure layer needs.

Tests (979 -> 987, +8):
- CosmosClientOptionsTests (NEW, 2 tests) — pins all 6 ADR-0025 § 2
  client options via real DI registration (catches wiring drift, not
  just config drift). Includes the null-safe ApplicationName path
  pinning the SDK constraint that empty strings are rejected.
- CosmosHealthCheckTests (NEW, 6 tests) — healthy / CosmosException
  with diagnostics data keys / generic exception / cancellation
  propagation / 2 ctor null guards.
- CosmosRepositoryTests.UpsertAsync_PassesEntityAndPartitionKey_ReturnsInputEntity
  (UPDATED) — asserts Assert.Same(entity, result) AND null ETag
  to prove the new contract returns the input instance, not the
  persisted body.

Pre-push self-audit:
- /local-review qualitative: ✅ 0 🔴 / 0 ⚠️ / 7 categories ✅
- 8-item mechanical (now includes Cosmos surface conformance from PR 1):
  ✅ all 8 items pass
- Build: 0/0 zero warnings as errors
- Identity: personal noreply

Per ADR-0025 § Architectural style this PR applies "Cosmos document
store + targeted CQRS materialized views; NOT full event sourcing"
to the client-options layer specifically — the materialized-view
container (machine_title_lookups) lands in PR 5.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 9, 2026
@jkeeley2073 jkeeley2073 enabled auto-merge May 9, 2026 13:26
Comment on lines +58 to +65
catch (Exception ex)
{
stopwatch.Stop();
_logger.LogWarning(
ex,
"Cosmos client warmup failed after {DurationMs:F0}ms. The first user query will pay the lazy-connection cost. Health check (`/healthz`) will surface persistent unreachability.",
stopwatch.Elapsed.TotalMilliseconds);
}
Comment on lines +80 to +83
catch (Exception ex)
{
return HealthCheckResult.Unhealthy($"Cosmos unreachable: {ex.Message}", exception: ex);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants