Skip to content

feat(infra) ARM-backed CosmosBootstrapper supersedes broken PR #62 RBAC#63

Merged
jkeeley2073 merged 2 commits into
mainfrom
Dev-CosmosBootstrapperArmSdk
May 4, 2026
Merged

feat(infra) ARM-backed CosmosBootstrapper supersedes broken PR #62 RBAC#63
jkeeley2073 merged 2 commits into
mainfrom
Dev-CosmosBootstrapperArmSdk

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Supersedes PR #62. PR #62 attempted to fix the --ensure-cosmos-containers 403 against deployed Cosmos by defining a custom Cosmos data-plane role with Microsoft.DocumentDB/databaseAccounts/sqlDatabases/* in dataActions. Azure rejected the deploy with "the provided data action string does not correspond to any valid SQL data action" — confirming Cosmos's data-plane RBAC genuinely does NOT model schema-mutation actions, regardless of role definition. PR #62 was therefore non-functional code that landed in main; the deploy never actually applied.

This PR re-architects: schema bootstrap routes through Azure Resource Manager (the management plane) instead of attempting to extend data-plane RBAC.

What's new

ICosmosProvisioner abstraction with two implementations:

  • ArmCosmosProvisioner — uses Azure.ResourceManager.CosmosDB against the management endpoint. Required for AAD-authed clients (deployed Cosmos). Auth check is by Azure RBAC at account scope: dev developer's subscription Owner inheritance covers it; production runtime principal will need Cosmos DB Operator (230815da-be43-4aae-9cb4-875f7bd000aa) at account scope.
  • DataPlaneCosmosProvisioner — uses the existing Microsoft.Azure.Cosmos SDK. Used for the Aspire preview emulator path (master-key auth permits data-plane schema CRUD without ARM).

CosmosBootstrapper now delegates to the provisioner. Selection in AddCosmosPersistence based on whether Cosmos:AccountResourceId is configured.

New configuration: CosmosOptions.AccountResourceId (string?) sourced from the new Bicep output cosmosAccountResourceId.

Bicep changes:

Idempotency: Both provisioners use probe-then-create symmetry. ArmCosmosProvisioner.EnsureDatabaseAndContainersAsync probes the database via ExistsAsync before issuing CreateOrUpdateAsync, mirroring the container path's drift-detection contract.

Friendly-error guard: IsLikelyCosmosAccountResourceId pre-check at DI-resolution time throws InvalidOperationException with a remediation message naming the right az cosmosdb show ... --query id invocation when an operator pastes the wrong value (e.g., documentEndpoint URL instead of the ARM resource ID). ResourceIdentifier's ctor doesn't validate input strings.

Test Plan

$env:Cosmos__AccountEndpoint   = az cosmosdb show -n pinwiz-cosmos-dev-hlpz4 -g rg-pinwiz-shared-dev --query documentEndpoint -o tsv
$env:Cosmos__AccountResourceId = az cosmosdb show -n pinwiz-cosmos-dev-hlpz4 -g rg-pinwiz-shared-dev --query id              -o tsv
dotnet run --project src/PinballWizard.Cli -- --ensure-cosmos-containers

Expected: "Database 'pinwiz' already present via ARM." plus "Container '<name>' already present via ARM (partition key /<path>)." for each container — idempotent against the entities created manually during PR #62's diagnostic session.

Out of Scope

  • Mocked-ArmClient integration tests. Reviewer flagged that the probe-then-create / drift-detection paths in ArmCosmosProvisioner aren't exercised by any test; only the DI-time selection is. Deferred — Azure.ResourceManager.CosmosDB's test fixture story is non-trivial (requires MockableArmResource/MockResponse plumbing), and the live re-deploy + smoke-test gives the canonical signal.
  • Logging-message parity. Reviewer noted ARM logs "already present via ARM" / "created via ARM" while the data-plane logs "ready via data-plane SDK" (single message regardless of created vs existed). Cosmetic; defer.

Checklist

  • CI is green (build + test + coverage + CodeQL + sanitization + bicep)
  • PR title follows the Conventional Commits format above
  • If this is a new architectural decision, an ADR has been added under docs/adr/ — N/A (this is the architecturally correct shape PR fix(infra) custom Cosmos data-plane role permits database creation #62 should have had; the underlying decision "Cosmos schema CRUD goes through ARM" is forced by Azure's data-plane RBAC, not chosen)
  • If user-visible behavior changes, README.md and/or docs/ are updated in the same PR — README gains a "Running against deployed Cosmos" subsection
  • If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed in the same PR — handoff memory will be updated post-merge once the smoke-test is observed to succeed
  • No TODO / FIXME / commented-out code committed
  • No new entries in <NoWarn> without a comment explaining why and the removal criterion

Pre-push self-audit

Step 0 — /local-review (qualitative)

  • Ran /local-review and addressed every critical finding before push
  • Local review outcome: 3 critical (all fixed) / 3 minor (all fixed) / 4 categories clean
    • Critical fixed: AzureLocation(account.Id.Name) constructed a region from a resource name. Replaced with AzureLocation.EastUS2 (matches CosmosOptions.PreferredRegions default and the deployment region) at both database and container call sites.
    • Critical fixed: Bicep didn't actually emit cosmosAccountResourceId. Added the output to modules/shared.bicep + passthrough in main-shared.bicep.
    • Critical fixed: Database CreateOrUpdateAsync was unconditional (no probe-then-create like the container path). Added ExistsAsync probe with explicit "created" / "already present" log differentiation.
    • Minor fixed: friendlier error when Cosmos:AccountResourceId is malformed (IsLikelyCosmosAccountResourceId pre-check + remediation message naming the right az query).
    • Minor fixed: AddCosmosPersistence XML doc updated to mention ICosmosProvisioner registration.
    • Minor fixed: bug-class concern about ResourceIdentifier ctor not validating input — addressed by the pre-check.

Step 1 — Mechanical checklist

  • Every new *Options property has at least one real getter call in src/CosmosOptions.AccountResourceId is read by AddCosmosPersistence (provisioner selection) AND consumed by ArmCosmosProvisioner ctor
  • Sibling-diffed against the closest existing implementation — ArmCosmosProvisioner matches DataPlaneCosmosProvisioner for ctor null-checks, idempotent shape, drift-detection error message string, and probe-then-create symmetry post-fix
  • No bare catch { } — only scoped catch (RequestFailedException ex) when (ex.Status == 404) for the container-missing case
  • New ISourceScraper? — N/A
  • Tests assert behavior, not just structure — CosmosProvisionerSelectionTests pins type-of-resolved-provisioner (selection contract), bootstrapper DI resolution, and the load-bearing remediation-message contents
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' shows personal noreply, not work email

PR #62 attempted to fix --ensure-cosmos-containers 403 against deployed
Cosmos by adding 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/*'
to a custom data-plane role. Azure rejected the deploy with 'the
provided data action string does not correspond to any valid SQL data
action' — confirming Cosmos's data-plane RBAC genuinely does NOT model
schema-mutation actions. PR #62 landed in main as non-functional code;
the deploy never actually applied.

This PR re-architects schema bootstrap to route through Azure Resource
Manager instead. New ICosmosProvisioner abstraction with two
implementations:

- ArmCosmosProvisioner — uses Azure.ResourceManager.CosmosDB against
  the management endpoint. Required for AAD-authed clients (deployed
  Cosmos). Auth is by Azure RBAC at account scope: dev developer's
  subscription Owner inheritance covers it; production runtime needs
  Cosmos DB Operator at account scope.
- DataPlaneCosmosProvisioner — uses Microsoft.Azure.Cosmos against the
  data-plane endpoint. Used for the Aspire preview emulator (master-
  key auth permits data-plane schema CRUD without ARM).

CosmosBootstrapper now delegates to the provisioner. Selection happens
in AddCosmosPersistence based on whether Cosmos:AccountResourceId is
set. New CosmosOptions.AccountResourceId field sourced from the new
Bicep output cosmosAccountResourceId.

Bicep changes:
- Revert PR #62's invalid cosmosDeveloperRole resource
- Restore PR #60's well-known Cosmos DB Built-in Data Contributor
  assignment (correct for runtime data-plane operations)
- Expose cosmosAccountResourceId as a new output (modules/shared.bicep
  + main-shared.bicep passthrough)

Both provisioners use probe-then-create symmetry for idempotency:
- ArmCosmosProvisioner.EnsureDatabaseAndContainersAsync probes the
  database via ExistsAsync before issuing CreateOrUpdateAsync, mirroring
  the container path's drift-detection contract
- DataPlaneCosmosProvisioner uses CreateContainerIfNotExistsAsync then
  asserts partition-key path matches (existing behavior preserved)

Friendly-error guard at DI-resolution time when Cosmos:AccountResourceId
is malformed (e.g., operator pasted documentEndpoint URL instead of
the ARM resource ID): IsLikelyCosmosAccountResourceId pre-check throws
InvalidOperationException naming the right 'az cosmosdb show ... --query
id' invocation. ResourceIdentifier's ctor doesn't validate, so the
guard prevents a generic FormatException from surfacing later.

README gains a 'Running against deployed Cosmos' subsection documenting
the dual env-var dance (Cosmos__AccountEndpoint + Cosmos__AccountResourceId).

Tests: 503 -> 507. CosmosProvisionerSelectionTests pins ARM-vs-data-
plane registration based on AccountResourceId presence, the
bootstrapper's DI shape, and the friendly-error remediation message.

Pre-push self-audit: /local-review (3 critical, all fixed before push:
AzureLocation bug from account.Id.Name -> AzureLocation.EastUS2;
missing cosmosAccountResourceId Bicep output added; database
CreateOrUpdateAsync unconditional -> probe-then-create symmetry; plus
3 minor, all addressed: friendlier parse error, AddCosmosPersistence
XML doc updated, AccountResourceId shape validation added) plus 7-item
mechanical checklist (all pass).

Re-deploy: pwsh ./infra/scripts/Deploy-SharedResources.ps1 -Environment dev

Then:
  $env:Cosmos__AccountEndpoint   = az cosmosdb show -n pinwiz-cosmos-dev-hlpz4 -g rg-pinwiz-shared-dev --query documentEndpoint -o tsv
  $env:Cosmos__AccountResourceId = az cosmosdb show -n pinwiz-cosmos-dev-hlpz4 -g rg-pinwiz-shared-dev --query id              -o tsv
  dotnet run --project src/PinballWizard.Cli -- --ensure-cosmos-containers
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 4, 2026
CodeQL flagged the ARM resource ID as 'clear text storage of sensitive
information' on the LogInformation call. ARM resource IDs aren't
secrets per the project's threat model (subscription IDs are public
identifiers per ADR 0010, committed deliberately in bicepparam files),
but log telemetry routes through App Insights / Log Analytics which
have broader access than the source repo, and the CodeQL heuristic is
catching the right shape: don't ship more than the operator-facing
diagnostic actually needs. Account name is sufficient.
@jkeeley2073 jkeeley2073 merged commit 8a19a48 into main May 4, 2026
6 checks passed
@jkeeley2073 jkeeley2073 deleted the Dev-CosmosBootstrapperArmSdk branch May 4, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants