feat(infra) ARM-backed CosmosBootstrapper supersedes broken PR #62 RBAC#63
Merged
Conversation
PR #62 attempted to fix --ensure-cosmos-containers 403 against deployed Cosmos by adding 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/*' to a custom data-plane role. Azure rejected the deploy with 'the provided data action string does not correspond to any valid SQL data action' — confirming Cosmos's data-plane RBAC genuinely does NOT model schema-mutation actions. PR #62 landed in main as non-functional code; the deploy never actually applied. This PR re-architects schema bootstrap to route through Azure Resource Manager instead. New ICosmosProvisioner abstraction with two implementations: - ArmCosmosProvisioner — uses Azure.ResourceManager.CosmosDB against the management endpoint. Required for AAD-authed clients (deployed Cosmos). Auth is by Azure RBAC at account scope: dev developer's subscription Owner inheritance covers it; production runtime needs Cosmos DB Operator at account scope. - DataPlaneCosmosProvisioner — uses Microsoft.Azure.Cosmos against the data-plane endpoint. Used for the Aspire preview emulator (master- key auth permits data-plane schema CRUD without ARM). CosmosBootstrapper now delegates to the provisioner. Selection happens in AddCosmosPersistence based on whether Cosmos:AccountResourceId is set. New CosmosOptions.AccountResourceId field sourced from the new Bicep output cosmosAccountResourceId. Bicep changes: - Revert PR #62's invalid cosmosDeveloperRole resource - Restore PR #60's well-known Cosmos DB Built-in Data Contributor assignment (correct for runtime data-plane operations) - Expose cosmosAccountResourceId as a new output (modules/shared.bicep + main-shared.bicep passthrough) Both provisioners use probe-then-create symmetry for idempotency: - ArmCosmosProvisioner.EnsureDatabaseAndContainersAsync probes the database via ExistsAsync before issuing CreateOrUpdateAsync, mirroring the container path's drift-detection contract - DataPlaneCosmosProvisioner uses CreateContainerIfNotExistsAsync then asserts partition-key path matches (existing behavior preserved) Friendly-error guard at DI-resolution time when Cosmos:AccountResourceId is malformed (e.g., operator pasted documentEndpoint URL instead of the ARM resource ID): IsLikelyCosmosAccountResourceId pre-check throws InvalidOperationException naming the right 'az cosmosdb show ... --query id' invocation. ResourceIdentifier's ctor doesn't validate, so the guard prevents a generic FormatException from surfacing later. README gains a 'Running against deployed Cosmos' subsection documenting the dual env-var dance (Cosmos__AccountEndpoint + Cosmos__AccountResourceId). Tests: 503 -> 507. CosmosProvisionerSelectionTests pins ARM-vs-data- plane registration based on AccountResourceId presence, the bootstrapper's DI shape, and the friendly-error remediation message. Pre-push self-audit: /local-review (3 critical, all fixed before push: AzureLocation bug from account.Id.Name -> AzureLocation.EastUS2; missing cosmosAccountResourceId Bicep output added; database CreateOrUpdateAsync unconditional -> probe-then-create symmetry; plus 3 minor, all addressed: friendlier parse error, AddCosmosPersistence XML doc updated, AccountResourceId shape validation added) plus 7-item mechanical checklist (all pass). Re-deploy: pwsh ./infra/scripts/Deploy-SharedResources.ps1 -Environment dev Then: $env:Cosmos__AccountEndpoint = az cosmosdb show -n pinwiz-cosmos-dev-hlpz4 -g rg-pinwiz-shared-dev --query documentEndpoint -o tsv $env:Cosmos__AccountResourceId = az cosmosdb show -n pinwiz-cosmos-dev-hlpz4 -g rg-pinwiz-shared-dev --query id -o tsv dotnet run --project src/PinballWizard.Cli -- --ensure-cosmos-containers
CodeQL flagged the ARM resource ID as 'clear text storage of sensitive information' on the LogInformation call. ARM resource IDs aren't secrets per the project's threat model (subscription IDs are public identifiers per ADR 0010, committed deliberately in bicepparam files), but log telemetry routes through App Insights / Log Analytics which have broader access than the source repo, and the CodeQL heuristic is catching the right shape: don't ship more than the operator-facing diagnostic actually needs. Account name is sufficient.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Supersedes PR #62. PR #62 attempted to fix the
--ensure-cosmos-containers403 against deployed Cosmos by defining a custom Cosmos data-plane role withMicrosoft.DocumentDB/databaseAccounts/sqlDatabases/*indataActions. Azure rejected the deploy with "the provided data action string does not correspond to any valid SQL data action" — confirming Cosmos's data-plane RBAC genuinely does NOT model schema-mutation actions, regardless of role definition. PR #62 was therefore non-functional code that landed inmain; the deploy never actually applied.This PR re-architects: schema bootstrap routes through Azure Resource Manager (the management plane) instead of attempting to extend data-plane RBAC.
What's new
ICosmosProvisionerabstraction with two implementations:ArmCosmosProvisioner— usesAzure.ResourceManager.CosmosDBagainst the management endpoint. Required for AAD-authed clients (deployed Cosmos). Auth check is by Azure RBAC at account scope: dev developer's subscription Owner inheritance covers it; production runtime principal will needCosmos DB Operator(230815da-be43-4aae-9cb4-875f7bd000aa) at account scope.DataPlaneCosmosProvisioner— uses the existingMicrosoft.Azure.CosmosSDK. Used for the Aspire preview emulator path (master-key auth permits data-plane schema CRUD without ARM).CosmosBootstrappernow delegates to the provisioner. Selection inAddCosmosPersistencebased on whetherCosmos:AccountResourceIdis configured.New configuration:
CosmosOptions.AccountResourceId(string?) sourced from the new Bicep outputcosmosAccountResourceId.Bicep changes:
cosmosDeveloperRoleresource.Cosmos DB Built-in Data Contributor(00000000-0000-0000-0000-000000000002) — correct for runtime data-plane operations (item CRUD, query, change feed, whichMachineRepository/IngestionSourceRepository/OpdbSyncServiceactually exercise).cosmosAccountResourceIdas a new output (inmodules/shared.bicep+ passthrough inmain-shared.bicep) so operators copy the value into$env:Cosmos__AccountResourceIdafter deploy.Idempotency: Both provisioners use probe-then-create symmetry.
ArmCosmosProvisioner.EnsureDatabaseAndContainersAsyncprobes the database viaExistsAsyncbefore issuingCreateOrUpdateAsync, mirroring the container path's drift-detection contract.Friendly-error guard:
IsLikelyCosmosAccountResourceIdpre-check at DI-resolution time throwsInvalidOperationExceptionwith a remediation message naming the rightaz cosmosdb show ... --query idinvocation when an operator pastes the wrong value (e.g.,documentEndpointURL instead of the ARM resource ID).ResourceIdentifier's ctor doesn't validate input strings.Test Plan
dotnet build PinballWizard.slnx-> 0 warnings, 0 errorsdotnet test PinballWizard.slnx-> 507 / 507 passing (was 503 —CosmosProvisionerSelectionTestsadds 4 tests covering ARM-vs-data-plane provisioner registration, the bootstrapper's DI shape, and the friendly-error remediation message)bicep buildclean (CI gate)pwsh ./infra/scripts/Deploy-SharedResources.ps1 -Environment dev(rolls back PR fix(infra) custom Cosmos data-plane role permits database creation #62's broken role definition; re-applies PR feat(infra) grant Cosmos data-plane RBAC to developer principal in Bicep #60's built-in Data Contributor assignment which never actually changed in Azure since the PR fix(infra) custom Cosmos data-plane role permits database creation #62 deploy failed mid-flight). Then:Expected:
"Database 'pinwiz' already present via ARM."plus"Container '<name>' already present via ARM (partition key /<path>)."for each container — idempotent against the entities created manually during PR #62's diagnostic session.Out of Scope
ArmClientintegration tests. Reviewer flagged that the probe-then-create / drift-detection paths inArmCosmosProvisioneraren't exercised by any test; only the DI-time selection is. Deferred —Azure.ResourceManager.CosmosDB's test fixture story is non-trivial (requiresMockableArmResource/MockResponseplumbing), and the live re-deploy + smoke-test gives the canonical signal."already present via ARM"/"created via ARM"while the data-plane logs"ready via data-plane SDK"(single message regardless of created vs existed). Cosmetic; defer.Checklist
docs/adr/— N/A (this is the architecturally correct shape PR fix(infra) custom Cosmos data-plane role permits database creation #62 should have had; the underlying decision "Cosmos schema CRUD goes through ARM" is forced by Azure's data-plane RBAC, not chosen)README.mdand/ordocs/are updated in the same PR — README gains a "Running against deployed Cosmos" subsection~/.claude/projects/c--projects-PinballWizard/memory/is now stale, it has been updated or removed in the same PR — handoff memory will be updated post-merge once the smoke-test is observed to succeedTODO/FIXME/ commented-out code committed<NoWarn>without a comment explaining why and the removal criterionPre-push self-audit
Step 0 —
/local-review(qualitative)/local-reviewand addressed every critical finding before pushAzureLocation(account.Id.Name)constructed a region from a resource name. Replaced withAzureLocation.EastUS2(matchesCosmosOptions.PreferredRegionsdefault and the deployment region) at both database and container call sites.cosmosAccountResourceId. Added the output tomodules/shared.bicep+ passthrough inmain-shared.bicep.CreateOrUpdateAsyncwas unconditional (no probe-then-create like the container path). AddedExistsAsyncprobe with explicit "created" / "already present" log differentiation.Cosmos:AccountResourceIdis malformed (IsLikelyCosmosAccountResourceIdpre-check + remediation message naming the rightazquery).AddCosmosPersistenceXML doc updated to mentionICosmosProvisionerregistration.ResourceIdentifierctor not validating input — addressed by the pre-check.Step 1 — Mechanical checklist
*Optionsproperty has at least one real getter call insrc/—CosmosOptions.AccountResourceIdis read byAddCosmosPersistence(provisioner selection) AND consumed byArmCosmosProvisionerctorArmCosmosProvisionermatchesDataPlaneCosmosProvisionerfor ctor null-checks, idempotent shape, drift-detection error message string, and probe-then-create symmetry post-fixcatch { }— only scopedcatch (RequestFailedException ex) when (ex.Status == 404)for the container-missing caseISourceScraper? — N/ACosmosProvisionerSelectionTestspins type-of-resolved-provisioner (selection contract), bootstrapper DI resolution, and the load-bearing remediation-message contentsgit log -1 --format='%an <%ae>'shows personal noreply, not work email