Skip to content

fix(infra) custom Cosmos data-plane role permits database creation#62

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-CosmosContainerBootstrapFixes
May 4, 2026
Merged

fix(infra) custom Cosmos data-plane role permits database creation#62
jkeeley2073 merged 1 commit into
mainfrom
Dev-CosmosContainerBootstrapFixes

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Fixes the bug surfaced when running dotnet run --project src/PinballWizard.Cli -- --ensure-cosmos-containers against the live Phase 1 deploy pinwiz-shared-dev-20260503122213 for the first time. The smoke-test returned Forbidden (403); SubStatus 5300 at POST /dbs (database creation).

Root cause

The built-in Cosmos DB Built-in Data Contributor role assigned in PR #60 has these dataActions:

Microsoft.DocumentDB/databaseAccounts/readMetadata
Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*       # actions on EXISTING containers
Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*  # actions on items

Notably missing: Microsoft.DocumentDB/databaseAccounts/sqlDatabases/write. The data-plane SDK call CreateDatabaseIfNotExistsAsync requires that action to create databases.

This is a known Cosmos RBAC gap — the built-in Contributor role is documented as "operations on existing data and entities; doesn't allow creating, deleting, or modifying entities like databases or containers."

Diagnostic chain

  1. Verified the az-authenticated principal Object ID matches the developerObjectId in infra/main-shared.dev.local.bicepparam (fb4fdb3e-…). MATCH.
  2. Verified the role assignment landed in Cosmos via az cosmosdb sql role assignment list. PRESENT.
  3. Verified it isn't a credential-source ordering issue by forcing AZURE_TOKEN_CREDENTIALS=azureclicredential. STILL 403.
  4. Read the role definition's dataActions directly via az cosmosdb sql role definition show ... --id 0…02. CONFIRMED sqlDatabases/write is missing.
  5. Verified subscription Owner inheritance still grants control-plane access — az cosmosdb sql database create succeeded from the same shell (created pinwiz database manually for the immediate workaround).

Fix

Replace the built-in role assignment in infra/modules/shared.bicep with a custom sqlRoleDefinitions resource named PinWiz Developer Data Contributor whose dataActions are:

  • Microsoft.DocumentDB/databaseAccounts/readMetadata
  • Microsoft.DocumentDB/databaseAccounts/sqlDatabases/* — covers database create/replace/delete + container CRUD + item CRUD + query + change feed in one declaration

The role assignment now references cosmosDeveloperRole.id instead of the built-in role ID. After re-deploy, --ensure-cosmos-containers will succeed end-to-end against deployed Cosmos.

Why custom role over Cosmos DB Operator (rejected alternative)

The ARM Cosmos DB Operator built-in role would also grant database/container management, but:

  • Its scope extends to operations like account delete (broader than the developer needs)
  • It splits the auth model across two ARM planes (data + control) — the SDK uses data-plane endpoints regardless
  • Custom data-plane role is the documented Microsoft pattern for this exact scenario

Re-deploy + verify

# Re-deploy applies the new role definition + assignment
pwsh ./infra/scripts/Deploy-SharedResources.ps1 -Environment dev

# Then re-run the smoke-test that originally 403'd
$cosmosAccount = az cosmosdb list -g rg-pinwiz-shared-dev --query "[0].name" -o tsv
$env:Cosmos__AccountEndpoint = az cosmosdb show -n $cosmosAccount -g rg-pinwiz-shared-dev --query documentEndpoint -o tsv
dotnet run --project src/PinballWizard.Cli -- --ensure-cosmos-containers
# Expected: "Cosmos database + containers ensured." (idempotent — the database
# and containers I created manually during diagnostics will be confirmed-existing
# rather than re-created; partition-key paths asserted to match)

Test Plan

  • dotnet build PinballWizard.slnx -> 0 warnings, 0 errors
  • dotnet test PinballWizard.slnx -> 503 / 503 passing (unchanged from main; this PR adds no .NET code)
  • bicep build infra/main-shared.bicep clean (CI gate)
  • Live verification deferred to the re-deploy step above — the user runs it after merge

Out of Scope

  • Manual cleanup of the orphaned built-in role assignment. Bicep manages role assignments by name (a guid() derived from the roleDefinitionId), so changing the role ID produces a different assignment name. The new assignment is created; the old built-in assignment from PR feat(infra) grant Cosmos data-plane RBAC to developer principal in Bicep #60 is left orphaned. It's a strict subset of the new custom role and is harmless. Manual cleanup if desired:
    $cosmos = az cosmosdb list -g rg-pinwiz-shared-dev --query "[0].name" -o tsv
    az cosmosdb sql role assignment list --account-name $cosmos --resource-group rg-pinwiz-shared-dev --query "[?contains(roleDefinitionId, '00000000-0000-0000-0000-000000000002')].id" -o tsv | ForEach-Object { az cosmosdb sql role assignment delete --account-name $cosmos --resource-group rg-pinwiz-shared-dev --role-assignment-id $_ --yes }
  • Tightening the sqlDatabases/* wildcard to specific actions (write/read/delete + containers/* + containers/items/*). Reviewer flagged as minor; deferred for personal dev environment. The wildcard adds throughputSettings/* (RU/s changes on shared-throughput databases) which the developer doesn't currently need but doesn't actively harm.
  • Bicep what-if golden-file test. Reviewer flagged that a snapshot test would have caught the missing sqlDatabases/write action offline. Deferred — would be a separate test-infra PR.
  • appsettings.json Cosmos:Containers section. I started to add this thinking it was needed, then reverted after re-reading CosmosOptions.cs:64-68 — the property defaults already declare machines (PK /manufacturer) and ingestion_sources (PK /partitionKey). Adding redundant config to appsettings would have created a new code-vs-config drift surface.

Checklist

  • CI is green (build + test + coverage + CodeQL + sanitization + bicep)
  • PR title follows the Conventional Commits format above
  • If this is a new architectural decision, an ADR has been added under docs/adr/ — N/A (this is a fix to PR feat(infra) grant Cosmos data-plane RBAC to developer principal in Bicep #60's RBAC implementation, not a new architectural choice; the choice of "data-plane RBAC over ARM Operator" is already locked by PR feat(infra) grant Cosmos data-plane RBAC to developer principal in Bicep #60's spirit)
  • If user-visible behavior changes, README.md and/or docs/ are updated in the same PR — N/A (the playbook in the session-handoff memory already references --ensure-cosmos-containers; the change here makes that documentation accurate)
  • If a memory in ~/.claude/projects/c--projects-PinballWizard/memory/ is now stale, it has been updated or removed in the same PR — handoff memory updated separately
  • No TODO / FIXME / commented-out code committed
  • No new entries in <NoWarn> without a comment explaining why and the removal criterion

Pre-push self-audit

Step 0 — /local-review (qualitative)

  • Ran /local-review and addressed every critical finding before push
  • Local review outcome: 0 critical / 2 minor (both deferred-with-justification) / 8 categories clean
    • Deferred: sqlDatabases/* wildcard is broader than strictly needed (adds throughputSettings/* privilege the developer doesn't use). Acceptable for single-developer personal dev environment; tighten on next visit if desired.
    • Deferred: missing Bicep what-if golden-file test. Would have caught the missing sqlDatabases/write action offline. Future test-infra PR.

Step 1 — Mechanical checklist

  • Every new *Options property has at least one real getter call in src/ — N/A (no .NET code changed)
  • Sibling-diffed against the closest existing implementation — the new custom-role pattern is intentionally distinct from the 5 other built-in role assignments in shared.bicep; reviewer confirmed this is correct because Cosmos data-plane uses a separate ARM namespace and no built-in Cosmos data role fits
  • No bare catch { } — N/A (no .NET code changed)
  • New ISourceScraper? — N/A
  • Tests assert behavior, not just structure — no new tests; deferred per reviewer (Bicep-only change, CI bicep-build covers schema validation)
  • Build is zero-warning
  • git log -1 --format='%an <%ae>' shows personal noreply, not work email

The built-in 'Cosmos DB Built-in Data Contributor' role granted in
PR #60 has dataActions:
  - readMetadata
  - sqlDatabases/containers/*
  - sqlDatabases/containers/items/*

Notably MISSING: sqlDatabases/write — meaning the role cannot create
databases. The first end-to-end run of '--ensure-cosmos-containers'
against the live Phase 1 deploy (pinwiz-shared-dev-20260503122213)
hit Forbidden (403); SubStatus 5300 at POST /dbs for exactly that
reason.

Diagnostic confirmed by reading the role definition's dataActions
directly via 'az cosmosdb sql role definition show'. Verified the
issue isn't credential-source ordering by forcing
AZURE_TOKEN_CREDENTIALS=azureclicredential and seeing the same 403.
Verified control-plane works (subscription Owner inheritance covers
ARM) by 'az cosmosdb sql database create' succeeding from the same
shell.

Fix replaces the built-in role assignment with a custom
sqlRoleDefinition resource named 'PinWiz Developer Data Contributor'
whose dataActions are readMetadata plus the account-wide wildcard
Microsoft.DocumentDB/databaseAccounts/sqlDatabases/* — covering
database create/replace/delete + container CRUD + item CRUD + query
+ change feed in one declaration. This is the Microsoft-documented
Cosmos pattern for self-sufficient post-deploy bootstrap. The ARM
'Cosmos DB Operator' alternative was rejected because Operator's
scope extends to operations like account delete (broader than the
developer needs) and splits the auth model across two planes.

Bicep manages role assignments by name via guid() derived from the
roleDefinitionId — since the role ID changes, the new assignment has
a different name from the old one, leaving the original built-in
assignment from PR #60 orphaned. It's a strict subset of the new
custom role and is harmless; can be cleaned manually with
'az cosmosdb sql role assignment delete' if desired.

Re-deploy: pwsh ./infra/scripts/Deploy-SharedResources.ps1 -Environment dev

After the re-deploy, '--ensure-cosmos-containers' succeeds end-to-end
against deployed Cosmos and is once again the canonical post-deploy
smoke-test PR #57 designed it to be.

No .NET code changed; tests still 503/503 passing. Build clean.

Pre-push self-audit: /local-review (0 critical / 2 minor / 8
categories clean — both minor are over-broad wildcard and missing
what-if golden-file, both deferrable for a single-developer personal
dev environment) plus 7-item mechanical checklist (all pass).
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 4, 2026
@jkeeley2073 jkeeley2073 merged commit 963080e into main May 4, 2026
6 checks passed
@jkeeley2073 jkeeley2073 deleted the Dev-CosmosContainerBootstrapFixes branch May 4, 2026 00:24
jkeeley2073 added a commit that referenced this pull request May 4, 2026
…rapperArmSdk

feat(infra) ARM-backed CosmosBootstrapper supersedes broken PR #62 RBAC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant