Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 25 additions & 5 deletions known_issues/10_azure_provider.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Known Issues: Azure Provider

> **Audit status (2026-04-21):** `2 follow-ups from CRITICAL rewrite · 12 resolved · 0 partially fixed · 0 moved · 1 new follow-up surfaced (batched SKU enrichment — deferred N+1 avoidance)`
> **Audit status (2026-04-25):** `2 follow-ups from CRITICAL rewrite · 13 resolved · 0 partially fixed · 0 moved · 1 narrower follow-up surfaced (compute Details fields require pkg/common type extension)`

## ~~CRITICAL: Recommendation converters ignore the API response entirely~~ — RESOLVED

Expand Down Expand Up @@ -300,13 +300,33 @@ Each converter's godoc now explicitly documents which fields are payload-derivab

### New follow-ups surfaced during this resolution

#### MEDIUM: Azure converters leave SKU-lookup Details fields empty (deferred N+1 avoidance)
#### ~~MEDIUM: Azure converters leave SKU-lookup Details fields empty (deferred N+1 avoidance)~~ — RESOLVED for 3 of 4 services

**File:** `providers/azure/services/{compute,database,cache,cosmosdb}/client.go::convertAzure*Recommendation`
**File:** `providers/azure/services/{database,cache,cosmosdb}/client.go::convertAzure*Recommendation`

**Description:** All four converters populate the subset of `Details` derivable from the reservation recommendation payload only. Fields that require an SDK catalogue lookup — compute Platform/Tenancy/Scope + memory/vCPU, database EngineVersion/AZConfig, cache Shards, cosmos APIType — are deliberately left empty because populating them inline would produce an N+1 API call per recommendation.
**Status:** ✔️ Resolved for cache/cosmos/database; compute scoped back (see follow-up below)

**Proposed fix:** single-shot batched SKU lookup per region per service (`armcompute.ResourceSKUsClient.ListByLocation(region)`, `armsql.SKUsClient.ListByServer`, `armredis.SKUsClient.ListByLocation`, `armcosmos.DatabaseAccountsClient.ListByResourceGroup`), called ONCE at the start of `GetRecommendations` for each region appearing in the response, cached in a `map[skuName]SKUCatalogueEntry`, then consulted as the converter builds each rec's Details. No N+1 storm, full Details population.
**Resolution:** Each of the three database/cache/cosmosdb service clients gained a lazy SKU catalogue cache (`sync.Once`-gated `map[string]<svcSKUEntry>`). The catalogue is fetched ONCE per client lifetime via the existing per-service pager interface — `armredis.Client.NewListBySubscriptionPager` for cache, `armsql.CapabilitiesClient.ListByLocation` for database, `armcosmos.DatabaseAccountsClient.NewListPager` for cosmos. Subsequent converter calls in the same `GetRecommendations` run hit the in-memory map (O(1) lookup), eliminating the N+1 storm. A failed catalogue fetch leaves the cache map nil; converters fall back to the previous empty-Details behaviour with a one-time WARN log — the conversion itself never fails.

Fields now populated from the catalogue:

- **Cache:** `CacheDetails.Shards` from `armredis.Properties.ShardCount` (clustered Premium-tier caches; non-clustered caches keep Shards at zero, which signals "unknown").
- **Database:** `DatabaseDetails.EngineVersion` from the SQL Server version traversed in `armsql.LocationCapabilities.SupportedServerVersions` → `SupportedEditions` → `SupportedServiceLevelObjectives` → `SKU.Name`. AZConfig/Deployment require per-server lookups and remain deferred.
- **Cosmos:** `NoSQLDetails.APIType` (sql / mongodb / cassandra / gremlin / table) resolved by mapping `account.Kind` + `account.Properties.Capabilities` for the single dominant APIType across the subscription's Cosmos accounts. Multi-API-type subscriptions leave APIType empty rather than guessing.

Tests in each service's `client_test.go` cover three new contracts per service:

- `_PopulatesShardsFromSKUCache` / `_PopulatesAPIType` / `_PopulatesEngineVersion` — cache hit populates the previously-deferred field.
- `_PagerErrorFallsBack` / `_CapabilitiesErrorFallsBack` — catalogue-fetch error leaves the field empty + the conversion still succeeds.
- `_FetchedOnce` — many converter calls share a single catalogue fetch (the N+1 invariant).

#### MEDIUM: Azure compute Details vCPU/MemoryGB require ComputeDetails type extension

**File:** `providers/azure/services/compute/client.go::convertAzureVMRecommendation`, `pkg/common/types.go::ComputeDetails`

**Description:** The cache/cosmos/database batched-SKU-catalogue lookup intentionally skipped compute. Reason: `common.ComputeDetails` currently exposes only `InstanceType`, `Platform`, `Tenancy`, `Scope`. The SKU-catalogue enrichments compute would supply (vCPU + MemoryGB from `armcompute.ResourceSKU.Capabilities`) have no struct field to write into. Adding fields to a shared `pkg/common` type touches every cloud provider's converter, the frontend rendering layer, the matchers, and existing snapshot/golden tests across all 3 cloud providers — well outside the scope of a perf change to the Azure converters.

**Proposed fix:** (1) add `VCPU int` + `MemoryGB float64` fields to `common.ComputeDetails`; (2) update AWS + GCP converters to populate them where the SDK already exposes the data; (3) port the same lazy `cachedSKULookup` pattern from cache/cosmos/database into compute and wire the new fields through.

**Status:** ❓ Needs triage

Expand Down
110 changes: 102 additions & 8 deletions providers/azure/services/cache/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (
"net/http"
"net/url"
"strings"
"sync"
"time"

"github.com/Azure/azure-sdk-for-go/sdk/azcore"
Expand All @@ -19,11 +20,22 @@ import (
"github.com/google/uuid"

"github.com/LeanerCloud/CUDly/pkg/common"
"github.com/LeanerCloud/CUDly/pkg/logging"
"github.com/LeanerCloud/CUDly/providers/azure/internal/httpclient"
"github.com/LeanerCloud/CUDly/providers/azure/internal/pricing"
"github.com/LeanerCloud/CUDly/providers/azure/internal/recommendations"
)

// redisSKUEntry holds the SKU-catalogue-derived fields the converter
// wants for each Redis SKU. Sourced from the cache's Properties:
// - shardCount: Properties.ShardCount (Premium-tier clustered caches).
//
// Non-clustered caches return shardCount=0; the converter treats 0 as
// "unknown" and leaves CacheDetails.Shards at its zero value.
type redisSKUEntry struct {
shardCount int
}

// HTTPClient interface for HTTP operations (enables mocking)
type HTTPClient interface {
Do(req *http.Request) (*http.Response, error)
Expand Down Expand Up @@ -56,6 +68,16 @@ type CacheClient struct {
recommendationsPager RecommendationsPager
reservationsPager ReservationsDetailsPager
redisCachesPager RedisCachesPager

// Lazy SKU catalogue cache. armredis exposes no per-region "list all
// possible SKUs" surface, so we derive shard counts from the existing
// caches in the subscription (NewListBySubscriptionPager). Fetched
// ONCE per client lifetime; subsequent converter calls in the same
// GetRecommendations run hit the in-memory map. A failed fetch leaves
// skuCacheMap nil and converters fall back to Shards=0 with a WARN
// log — the conversion itself does NOT fail.
skuCacheOnce sync.Once
skuCacheMap map[string]redisSKUEntry
}

// NewClient creates a new Azure Cache client
Expand Down Expand Up @@ -557,14 +579,23 @@ func extractRedisPricing(items []CacheRetailPriceItem, termYears int) (onDemand,
// SDK-to-struct ladder. Returns nil when the SDK payload is unusable.
//
// Details populated by parsing the SKU string into Engine ("redis") and
// NodeType. Shard count requires an armredis SKU catalogue lookup and is
// deferred to a batched-enrichment follow-up tracked in
// known_issues/10_azure_provider.md.
func (c *CacheClient) convertAzureRedisRecommendation(_ context.Context, azureRec armconsumption.ReservationRecommendationClassification) *common.Recommendation {
// NodeType, then enriched from the lazily-cached armredis catalogue
// (cachedSKULookup). Shards is sourced from the catalogue when an
// existing cache in the subscription matches the recommendation's
// Premium-tier SKU; otherwise stays 0 (zero means "unknown", not
// "definitely zero shards" — see the redisSKUEntry godoc).
func (c *CacheClient) convertAzureRedisRecommendation(ctx context.Context, azureRec armconsumption.ReservationRecommendationClassification) *common.Recommendation {
f := recommendations.Extract(azureRec)
if f == nil {
return nil
}
details := common.CacheDetails{
Engine: "redis",
NodeType: f.ResourceType,
}
if entry, ok := c.cachedSKULookup(ctx, f.ResourceType); ok && entry.shardCount > 0 {
details.Shards = entry.shardCount
}
return &common.Recommendation{
Provider: common.ProviderAzure,
Service: common.ServiceCache,
Expand All @@ -579,11 +610,74 @@ func (c *CacheClient) convertAzureRedisRecommendation(_ context.Context, azureRe
Term: f.Term,
PaymentOption: "upfront",
Timestamp: time.Now(),
Details: common.CacheDetails{
Engine: "redis",
NodeType: f.ResourceType,
},
Details: details,
}
}

// cachedSKULookup returns the SKU catalogue entry for skuName, fetching
// the catalogue lazily on first call. The catalogue is fetched ONCE per
// client lifetime via armredis.Client.NewListBySubscriptionPager;
// subsequent calls are O(1) map lookups. ok=false on cache miss OR
// catalogue-fetch failure — the caller falls back to Shards=0 rather
// than failing the whole conversion.
//
// Why source from existing caches: armredis exposes no "list all
// possible SKUs" endpoint. The catalogue surface that does exist
// (existing cache instances in the subscription) gives us authoritative
// shard counts for the SKUs the customer actually uses, which is the
// set the recommendation engine recommends from anyway.
func (c *CacheClient) cachedSKULookup(ctx context.Context, skuName string) (redisSKUEntry, bool) {
c.skuCacheOnce.Do(func() {
c.skuCacheMap = c.fetchSKUCatalogue(ctx)
})
if c.skuCacheMap == nil {
return redisSKUEntry{}, false
}
entry, ok := c.skuCacheMap[skuName]
return entry, ok
}

// fetchSKUCatalogue performs the single ListBySubscription walk and
// reduces the response into a name->redisSKUEntry map keyed by the
// "<Tier>_<Family><Capacity>" SKU naming convention (e.g. "Premium_P1")
// that matches the recommendation engine's ResourceType output.
//
// Returns nil on error so the sync.Once-gated cache field stays nil and
// cachedSKULookup falls back to the empty-Details path. The fetch error
// is logged WARN once.
func (c *CacheClient) fetchSKUCatalogue(ctx context.Context) map[string]redisSKUEntry {
pager, err := c.createRedisCachesPager()
if err != nil {
logging.Warnf("azure cache: SKU catalogue pager create failed for region %s: %v — Details.Shards left at 0", c.region, err)
return nil
}
out := make(map[string]redisSKUEntry)
for pager.More() {
page, err := pager.NextPage(ctx)
if err != nil {
logging.Warnf("azure cache: SKU catalogue page fetch failed for region %s: %v — partial cache (%d entries) discarded, Details.Shards left at 0", c.region, err, len(out))
return nil
}
for _, cache := range page.Value {
fullSKU := extractSKUFromCache(cache)
if fullSKU == "" {
continue
}
shards := 0
if cache.Properties != nil && cache.Properties.ShardCount != nil {
shards = int(*cache.Properties.ShardCount)
}
// First-write-wins: if the same SKU appears on two caches
// with different ShardCounts, keep the first. Real Premium
// clusters configured to the same SKU typically share the
// shard count anyway — this just keeps the cache
// deterministic regardless of pager order.
if _, exists := out[fullSKU]; !exists {
out[fullSKU] = redisSKUEntry{shardCount: shards}
}
}
}
return out
}

// applyPurchaseAutomationTag attaches the purchase-automation tag to an Azure
Expand Down
130 changes: 126 additions & 4 deletions providers/azure/services/cache/client_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -740,9 +740,14 @@ func TestCacheClient_ConvertAzureRedisRecommendation_NilGuards(t *testing.T) {

// TestCacheClient_ConvertAzureRedisRecommendation_PopulatesAllFields asserts
// the converter forwards every helper-extracted field + applies the
// Cache-service-specific constants.
// Cache-service-specific constants. An empty SKU catalogue (no caches in
// the subscription) is the "no signal" baseline — Shards stays 0.
func TestCacheClient_ConvertAzureRedisRecommendation_PopulatesAllFields(t *testing.T) {
client := NewClient(nil, "test-subscription", "eastus")
// Inject empty mock pager so the cachedSKULookup hits the lazy-fetch
// codepath without making a real Azure API call.
client.SetRedisCachesPager(&MockRedisCachesPager{})

rec := mocks.BuildLegacyReservationRecommendation(
mocks.WithRegion("eastus"),
mocks.WithTerm("P3Y"),
Expand All @@ -765,14 +770,131 @@ func TestCacheClient_ConvertAzureRedisRecommendation_PopulatesAllFields(t *testi
assert.Equal(t, "3yr", out.Term)
assert.Equal(t, "upfront", out.PaymentOption)

// Details carries Engine=redis + NodeType from the SKU string.
// Shard count is deferred to batched enrichment.
// Details carries Engine=redis + NodeType from the SKU string. Shards
// stays 0 when no cache instance matches in the subscription (the
// catalogue's only signal source for shard counts).
require.NotNil(t, out.Details)
details, ok := out.Details.(common.CacheDetails)
require.True(t, ok, "Details must be a common.CacheDetails value")
assert.Equal(t, "redis", details.Engine)
assert.Equal(t, "Premium_P3", details.NodeType)
assert.Equal(t, 0, details.Shards, "Shards is deferred to batched enrichment")
assert.Equal(t, 0, details.Shards, "Shards is 0 when no matching cache exists in the subscription")
}

// TestCacheClient_ConvertAzureRedisRecommendation_PopulatesShardsFromSKUCache
// asserts the new batched-SKU-catalogue lookup populates CacheDetails.Shards
// when the subscription has a Premium-tier clustered cache with the same
// SKU as the recommendation. Single ListBySubscription call per client
// lifetime feeds many converter calls; this test pins the contract.
func TestCacheClient_ConvertAzureRedisRecommendation_PopulatesShardsFromSKUCache(t *testing.T) {
client := NewClient(nil, "test-subscription", "eastus")

premiumName := armredis.SKUNamePremium
familyP := armredis.SKUFamilyP
capacity := int32(3)
shardCount := int32(5)

// One Premium_P3 cache with 5 shards in the subscription. The
// converter will see f.ResourceType="Premium_P3" and look it up.
mockPager := &MockRedisCachesPager{
pages: []armredis.ClientListBySubscriptionResponse{
{
ListResult: armredis.ListResult{
Value: []*armredis.ResourceInfo{
{
Properties: &armredis.Properties{
SKU: &armredis.SKU{
Name: &premiumName,
Family: &familyP,
Capacity: &capacity,
},
ShardCount: &shardCount,
},
},
},
},
},
},
}
client.SetRedisCachesPager(mockPager)

rec := mocks.BuildLegacyReservationRecommendation(
mocks.WithRegion("eastus"),
mocks.WithNormalizedSize("Premium_P3"),
)
out := client.convertAzureRedisRecommendation(context.Background(), rec)
require.NotNil(t, out)
details, ok := out.Details.(common.CacheDetails)
require.True(t, ok)
assert.Equal(t, "redis", details.Engine)
assert.Equal(t, "Premium_P3", details.NodeType)
assert.Equal(t, 5, details.Shards, "Shards must be enriched from the cached SKU catalogue")
}

// TestCacheClient_ConvertAzureRedisRecommendation_PagerErrorFallsBack
// asserts that a SKU-catalogue fetch failure does NOT fail the
// conversion — Shards just stays at 0 and the rest of Details is
// populated from the recommendation payload as before. This is the
// graceful-degradation contract the issue asks for.
func TestCacheClient_ConvertAzureRedisRecommendation_PagerErrorFallsBack(t *testing.T) {
client := NewClient(nil, "test-subscription", "eastus")
mockPager := &MockRedisCachesPager{
pages: []armredis.ClientListBySubscriptionResponse{{}},
err: errors.New("transient Azure API error"),
}
client.SetRedisCachesPager(mockPager)

rec := mocks.BuildLegacyReservationRecommendation(
mocks.WithRegion("eastus"),
mocks.WithNormalizedSize("Premium_P3"),
)
out := client.convertAzureRedisRecommendation(context.Background(), rec)
require.NotNil(t, out, "conversion must NOT fail on catalogue-fetch error")
details, ok := out.Details.(common.CacheDetails)
require.True(t, ok)
assert.Equal(t, "redis", details.Engine)
assert.Equal(t, "Premium_P3", details.NodeType)
assert.Equal(t, 0, details.Shards, "Shards left at 0 when catalogue fetch fails")
}

// TestCacheClient_CachedSKULookup_FetchedOnce pins the perf invariant:
// many converter calls in the same GetRecommendations run trigger
// exactly ONE catalogue fetch. The mock pager counts pages served; the
// assertion verifies the count stays at 1 after multiple lookups.
func TestCacheClient_CachedSKULookup_FetchedOnce(t *testing.T) {
client := NewClient(nil, "test-subscription", "eastus")

premiumName := armredis.SKUNamePremium
familyP := armredis.SKUFamilyP
capacity := int32(1)
shardCount := int32(2)

mockPager := &MockRedisCachesPager{
pages: []armredis.ClientListBySubscriptionResponse{
{
ListResult: armredis.ListResult{
Value: []*armredis.ResourceInfo{
{
Properties: &armredis.Properties{
SKU: &armredis.SKU{
Name: &premiumName,
Family: &familyP,
Capacity: &capacity,
},
ShardCount: &shardCount,
},
},
},
},
},
},
}
client.SetRedisCachesPager(mockPager)

for i := 0; i < 10; i++ {
_, _ = client.cachedSKULookup(context.Background(), "Premium_P1")
}
assert.Equal(t, 1, mockPager.index, "catalogue must be fetched ONCE regardless of lookup count")
}

// Test the to package is properly imported (used in tests)
Expand Down
16 changes: 10 additions & 6 deletions providers/azure/services/compute/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -672,11 +672,15 @@ func extractVMPricing(items []AzureRetailPriceItem, termYears int) (onDemand, re
// providers/azure/internal/recommendations so the four Azure service
// converters share the same type-assertion + nil-guard ladder.
//
// Details populated from the reservation recommendation payload only —
// Platform / Tenancy / Scope (and the vCPU / memory counts a UI would
// want) require an ARM SKU-catalogue lookup; populating them here would
// trigger an N+1 armcompute.ResourceSKUsClient.ListByLocation per
// recommendation. That batched-enrichment is tracked as a follow-up in
// Compute is NOT yet wired to the batched-SKU-catalogue lookup landed
// for cache/cosmosdb/database. Reason: common.ComputeDetails (in
// pkg/common/types.go) currently only exposes InstanceType, Platform,
// Tenancy, Scope. The two enrichments the SKU catalogue would supply
// for compute are vCPU and MemoryGB — neither of which has a field on
// ComputeDetails to write into. Adding new fields to a shared struct
// touches every cloud provider + every consumer (frontend, common
// matchers, three other providers' converters), which is well outside
// the scope of this perf change. Tracked as a follow-up — see
// known_issues/10_azure_provider.md.
func (c *ComputeClient) convertAzureVMRecommendation(_ context.Context, azureRec armconsumption.ReservationRecommendationClassification) *common.Recommendation {
f := recommendations.Extract(azureRec)
Expand All @@ -699,7 +703,7 @@ func (c *ComputeClient) convertAzureVMRecommendation(_ context.Context, azureRec
Timestamp: time.Now(),
// Only InstanceType is safely derivable from the payload. Platform,
// Tenancy, Scope are armcompute.ResourceSKUsClient territory and
// are deferred to the batched-enrichment follow-up.
// are deferred — see the function godoc for the scope decision.
Details: common.ComputeDetails{
InstanceType: f.ResourceType,
},
Expand Down
Loading