Skip to content

perf(azure): batched SKU catalogue lookup eliminates N+1 in recommendation converters#81

Merged
cristim merged 1 commit into
feat/multicloud-web-frontendfrom
perf/azure-batched-sku-lookup
Apr 27, 2026
Merged

perf(azure): batched SKU catalogue lookup eliminates N+1 in recommendation converters#81
cristim merged 1 commit into
feat/multicloud-web-frontendfrom
perf/azure-batched-sku-lookup

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented Apr 25, 2026

Summary

Adds a lazy, single-shot SKU catalogue lookup to three of the four Azure recommendation converters (cache, cosmos, database) so the previously-deferred Details fields populate without N+1 SDK calls.

  • Cache: CacheDetails.Shards from armredis.Properties.ShardCount.
  • Cosmos: NoSQLDetails.APIType (sql / mongodb / cassandra / gremlin / table) from the dominant Cosmos account Kind + Capabilities in the subscription.
  • Database: DatabaseDetails.EngineVersion from armsql.CapabilitiesClient.ListByLocation traversal.

Each converter calls a per-service cachedSKULookup (or cachedAPIType on cosmos). The catalogue is fetched ONCE per client lifetime via sync.Once; many converter calls in the same GetRecommendations run hit the in-memory map. A failed catalogue fetch leaves the cache nil and converters fall back to the previous empty-Details behaviour with a one-time WARN log — the conversion itself never fails.

Compute scoped back: common.ComputeDetails (in pkg/common/types.go) currently exposes only InstanceType / Platform / Tenancy / Scope. The two enrichments compute would supply (vCPU, MemoryGB) have no struct field to write into. Adding fields to a shared pkg/common type would touch all 3 cloud providers, the frontend, and the matchers — out of scope for this perf change. Tracked as a narrower follow-up in known_issues/10_azure_provider.md.

Test plan

New per-service tests in the existing *_test.go:

  • _PopulatesShardsFromSKUCache / _PopulatesAPIType (5 sub-cases) / _PopulatesEngineVersion — cache hit populates the previously-deferred field.
  • _PagerErrorFallsBack / _CapabilitiesErrorFallsBack — fetch error leaves the field empty and conversion still succeeds (graceful-degradation contract).
  • _CachedSKULookup_FetchedOnce / _CachedAPIType_FetchedOnce — many converter calls share a single catalogue fetch (the N+1 invariant pinned in tests).
  • _AmbiguousAPIType — multi-API-type Cosmos subscription leaves APIType empty rather than guessing.
  • Pre-existing _PopulatesAllFields tests now inject empty mock pagers so they don't hit the real Azure API on first lookup.
  • cd providers/azure && go test ./... — all green.
  • go test ./... (root) — all green.
  • go vet ./... — clean.
  • Pre-commit hooks (gofmt, gocyclo, gosec, etc.) — all green.

Closes #49 for cache / cosmos / database. Compute follow-up filed as a narrower issue in the known-issues doc.

…ation converters

The four Azure recommendation converters (compute, database, cache,
cosmosdb) previously left SKU-derived `Details` fields empty because
populating them inline would have triggered an N+1 SDK call per
recommendation. This change wires three of them to a lazily-cached,
single-shot SKU catalogue lookup, gated by `sync.Once` and held for the
client's lifetime.

Per-service:

- **Cache** (`services/cache`): caches `Properties.ShardCount` per SKU
  from `armredis.Client.NewListBySubscriptionPager`. Converter populates
  `CacheDetails.Shards` for Premium-tier clustered caches.
- **Cosmos** (`services/cosmosdb`): caches the dominant Cosmos APIType
  (mongodb / cassandra / gremlin / table / sql) across the
  subscription's accounts via `armcosmos.DatabaseAccountsClient.NewListPager`,
  mapping `account.Kind` + `Capabilities`. Converter populates
  `NoSQLDetails.APIType`. Multi-API-type subscriptions leave APIType
  empty rather than guessing.
- **Database** (`services/database`): caches engine version per SKU
  from `armsql.CapabilitiesClient.ListByLocation`, traversing
  ServerVersion → Editions → ServiceLevelObjectives. Converter
  populates `DatabaseDetails.EngineVersion`.

Each converter calls a per-service `cachedSKULookup` (or `cachedAPIType`
on cosmos). The catalogue is fetched ONCE per client lifetime; many
converter calls in the same `GetRecommendations` run hit the in-memory
map. A failed catalogue fetch leaves the cache nil and converters fall
back to the previous empty-Details behaviour with a one-time WARN log —
the conversion itself never fails, preserving the
graceful-degradation contract.

**Compute scoped back**: `common.ComputeDetails` (in `pkg/common/types.go`)
exposes only InstanceType/Platform/Tenancy/Scope. The two enrichments
the SKU catalogue would supply for compute are vCPU and MemoryGB —
neither has a struct field to write into. Adding fields to the shared
type would touch all 3 cloud providers' converters, the frontend, and
the matchers, well outside this perf change's scope. Documented as a
narrower follow-up in `known_issues/10_azure_provider.md`.

Tests per service in the existing `*_test.go`:

- `_PopulatesShardsFromSKUCache` / `_PopulatesAPIType` /
  `_PopulatesEngineVersion` — cache hit populates the previously-deferred
  field.
- `_PagerErrorFallsBack` / `_CapabilitiesErrorFallsBack` — fetch error
  leaves the field empty and conversion still succeeds.
- `_FetchedOnce` — many converter calls share a single catalogue fetch
  (the N+1 invariant).
- The pre-existing `_PopulatesAllFields` tests now inject empty mock
  pagers so they don't hit the real Azure API on first lookup.

Refs #49
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@cristim has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 52 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 31 minutes and 52 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d7c96a5-6222-434c-83cc-0581b1bff0ec

📥 Commits

Reviewing files that changed from the base of the PR and between 96dccb8 and b3fe6e2.

📒 Files selected for processing (8)
  • known_issues/10_azure_provider.md
  • providers/azure/services/cache/client.go
  • providers/azure/services/cache/client_test.go
  • providers/azure/services/compute/client.go
  • providers/azure/services/cosmosdb/client.go
  • providers/azure/services/cosmosdb/client_test.go
  • providers/azure/services/database/client.go
  • providers/azure/services/database/client_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/azure-batched-sku-lookup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 25, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

Two follow-up GitHub issues filed for the deferrals this PR explicitly scopes back:

Both reference back here and propose mirroring the cachedAPIType / cachedSKULookup patterns this PR establishes.

@cristim cristim merged commit 5e3b54d into feat/multicloud-web-frontend Apr 27, 2026
3 checks passed
@cristim cristim added triaged Item has been triaged priority/p3 Polish / idea / may never ship severity/low Minor harm urgency/this-quarter Within the quarter impact/few Limited audience effort/l Weeks type/feat New capability labels Apr 28, 2026
@cristim cristim deleted the perf/azure-batched-sku-lookup branch April 29, 2026 10:08
cristim added a commit that referenced this pull request Apr 30, 2026
Adds optional VCPU (int) and MemoryGB (float64) fields to
common.ComputeDetails so per-provider catalogue lookups can enrich
compute recommendations with sizing data. Both fields use json
omitempty — converters that don't yet wire a catalogue leave them at
the zero value and the API payload stays clean.

GetDetailDescription appends " (N vCPU / M GB)" when both values are
> 0, otherwise returns the existing "platform/tenancy" form. Format
uses %g so 16 GB renders as "16 GB" (not "16.000000 GB") while 0.5 GB
SKUs render as "0.5 GB".

Scope decision — schema-only foundation:

- Azure: PR #81 (perf/azure-batched-sku-lookup) introduces the
  cachedSKULookup helper for cache/cosmos/database. Compute was
  scoped back from #81 because these fields didn't exist. With the
  schema in place, a follow-up PR can wire compute on top of #81 once
  it merges (filed as a separate issue post-merge to avoid an
  unmerged-PR dependency in this changeset).
- AWS: ce.EC2InstanceDetails (Cost-Explorer) has no vCPU/Memory
  fields; populating would require a new ec2:DescribeInstanceTypes
  caller + cache + mocks. Substantial net-new code, tracked as a
  separate follow-up.
- GCP: no compute recommendation converter exists today. N/A.
- Frontend: api.Recommendation has no `details` field and the detail
  drawer renders a hard-coded list. Surfacing the new fields needs
  separate API + UI work; tracked as follow-up once at least one
  provider populates the fields.

Tests: extend pkg/common/types_test.go::TestComputeDetails_GetDetail
Description with 4 new table rows — VCPU-only, MemoryGB-only,
integer-GB, and fractional-GB (Azure 0.5 GB SKU shape).

Refs #82
cristim added a commit that referenced this pull request May 3, 2026
closes #148) (#229)

PR #97 added VCPU + MemoryGB fields to common.ComputeDetails (with
omitempty JSON tags). PR #81 introduced the lazy SKU-catalogue pattern
for cache/cosmosdb/database but explicitly scoped back compute because
the destination fields didn't exist. Both prerequisites are now in
place; this wires Azure compute to the same pattern.

Implementation mirrors cache/database verbatim for consistency:

- New unexported vmSKUEntry{vCPUs, memoryGB}.
- New skuCacheOnce sync.Once + skuCacheMap field on ComputeClient.
- cachedSKULookup(ctx, skuName) lazily triggers the catalogue fetch on
  first call; subsequent calls are O(1) map lookups. ok=false on miss
  or fetch error.
- fetchSKUCatalogue reuses the existing createResourceSKUsPager and
  isAvailableInRegion helpers (already used by GetValidResourceTypes),
  walks every page, and reduces virtualMachines SKUs into the map.
  Returns nil on error so the sync.Once-gated cache stays nil and
  converters fall back to the empty-fields path with a one-time WARN.
- extractVMSKUCapabilities pulls vCPUs (Atoi) and MemoryGB
  (ParseFloat) from the SKU's Capabilities name/value list. Unparseable
  or missing capabilities → 0, treated as "unknown".
- populateVMSKUMapFromPage was extracted out of fetchSKUCatalogue to
  stay under the project's gocyclo=10 threshold (matches the
  cache/database extraction pattern from PR #81).
- convertAzureVMRecommendation now takes ctx (was _) and populates
  Details.VCPU/MemoryGB from the cache when both >0. The single caller
  GetRecommendations already passes ctx; no public API change.

Behaviour preserved on failure: a transient ResourceSKUsClient error
no longer breaks the conversion — VCPU/MemoryGB stay at 0 (omitempty
hides them from API payloads), the rest of Details is populated from
the recommendation payload as before, and a WARN is logged once per
client lifetime.

UX improvement: common.ComputeDetails.GetDetailDescription appends
" (<vcpu> vCPU / <memory> GB)" when both fields are >0, so Azure VM
recommendation summaries now include the size string instead of the
SKU name alone.

Tests added in providers/azure/services/compute/client_test.go
(file-scoped vmSKUCatalogueMockPager keeps the shared
mocks.MockResourceSKUsPager surface untouched, mirroring the
cosmosdb / cache test convention):

- TestComputeClient_ConvertAzureVMRecommendation_PopulatesVCPUAndMemoryFromSKUCache:
  catalogue hit populates VCPU=2, MemoryGB=8 for Standard_D2s_v3.
- TestComputeClient_ConvertAzureVMRecommendation_PagerErrorFallsBack:
  fetch error leaves both fields at 0; conversion succeeds.
- TestComputeClient_ConvertAzureVMRecommendation_NoMatchLeavesFieldsZero:
  catalogue miss leaves both fields at 0; conversion succeeds.
- TestComputeClient_CachedSKULookup_FetchedOnce: 10 lookups trigger
  exactly 1 NextPage call (pins the N+1 invariant per PR #81).

Out of scope (separate follow-ups, per the issue body):

- AWS / GCP compute converter wiring.
- Platform / Tenancy / Scope enrichment (require non-ResourceSKUs
  Azure data sources).

Verification: gofmt clean; go vet ./... clean; gocyclo -over 10 finds
no offenders in the modified file; go test
github.com/LeanerCloud/CUDly/providers/azure/services/compute → 42
passing (4 new + all existing). The pre-existing
providers/azure/services/search build failure (missing go.sum entry)
is unrelated to this change.

Closes #148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/l Weeks impact/few Limited audience priority/p3 Polish / idea / may never ship severity/low Minor harm triaged Item has been triaged type/feat New capability urgency/this-quarter Within the quarter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant