LeanerCloud · cristim · Apr 27, 2026 · Apr 25, 2026
diff --git a/known_issues/10_azure_provider.md b/known_issues/10_azure_provider.md
@@ -1,6 +1,6 @@
 # Known Issues: Azure Provider
 
-> **Audit status (2026-04-21):** `2 follow-ups from CRITICAL rewrite · 12 resolved · 0 partially fixed · 0 moved · 1 new follow-up surfaced (batched SKU enrichment — deferred N+1 avoidance)`
+> **Audit status (2026-04-25):** `2 follow-ups from CRITICAL rewrite · 13 resolved · 0 partially fixed · 0 moved · 1 narrower follow-up surfaced (compute Details fields require pkg/common type extension)`
 
 ## ~~CRITICAL: Recommendation converters ignore the API response entirely~~ — RESOLVED
 
@@ -300,13 +300,33 @@ Each converter's godoc now explicitly documents which fields are payload-derivab
 
 ### New follow-ups surfaced during this resolution
 
-#### MEDIUM: Azure converters leave SKU-lookup Details fields empty (deferred N+1 avoidance)
+#### ~~MEDIUM: Azure converters leave SKU-lookup Details fields empty (deferred N+1 avoidance)~~ — RESOLVED for 3 of 4 services
 
-**File:** `providers/azure/services/{compute,database,cache,cosmosdb}/client.go::convertAzure*Recommendation`
+**File:** `providers/azure/services/{database,cache,cosmosdb}/client.go::convertAzure*Recommendation`
 
-**Description:** All four converters populate the subset of `Details` derivable from the reservation recommendation payload only. Fields that require an SDK catalogue lookup — compute Platform/Tenancy/Scope + memory/vCPU, database EngineVersion/AZConfig, cache Shards, cosmos APIType — are deliberately left empty because populating them inline would produce an N+1 API call per recommendation.
+**Status:** ✔️ Resolved for cache/cosmos/database; compute scoped back (see follow-up below)
 
-**Proposed fix:** single-shot batched SKU lookup per region per service (`armcompute.ResourceSKUsClient.ListByLocation(region)`, `armsql.SKUsClient.ListByServer`, `armredis.SKUsClient.ListByLocation`, `armcosmos.DatabaseAccountsClient.ListByResourceGroup`), called ONCE at the start of `GetRecommendations` for each region appearing in the response, cached in a `map[skuName]SKUCatalogueEntry`, then consulted as the converter builds each rec's Details. No N+1 storm, full Details population.
+**Resolution:** Each of the three database/cache/cosmosdb service clients gained a lazy SKU catalogue cache (`sync.Once`-gated `map[string]<svcSKUEntry>`). The catalogue is fetched ONCE per client lifetime via the existing per-service pager interface — `armredis.Client.NewListBySubscriptionPager` for cache, `armsql.CapabilitiesClient.ListByLocation` for database, `armcosmos.DatabaseAccountsClient.NewListPager` for cosmos. Subsequent converter calls in the same `GetRecommendations` run hit the in-memory map (O(1) lookup), eliminating the N+1 storm. A failed catalogue fetch leaves the cache map nil; converters fall back to the previous empty-Details behaviour with a one-time WARN log — the conversion itself never fails.
+
+Fields now populated from the catalogue:
+
+- **Cache:** `CacheDetails.Shards` from `armredis.Properties.ShardCount` (clustered Premium-tier caches; non-clustered caches keep Shards at zero, which signals "unknown").
+- **Database:** `DatabaseDetails.EngineVersion` from the SQL Server version traversed in `armsql.LocationCapabilities.SupportedServerVersions` → `SupportedEditions` → `SupportedServiceLevelObjectives` → `SKU.Name`. AZConfig/Deployment require per-server lookups and remain deferred.
+- **Cosmos:** `NoSQLDetails.APIType` (sql / mongodb / cassandra / gremlin / table) resolved by mapping `account.Kind` + `account.Properties.Capabilities` for the single dominant APIType across the subscription's Cosmos accounts. Multi-API-type subscriptions leave APIType empty rather than guessing.
+
+Tests in each service's `client_test.go` cover three new contracts per service:
+
+- `_PopulatesShardsFromSKUCache` / `_PopulatesAPIType` / `_PopulatesEngineVersion` — cache hit populates the previously-deferred field.
+- `_PagerErrorFallsBack` / `_CapabilitiesErrorFallsBack` — catalogue-fetch error leaves the field empty + the conversion still succeeds.
+- `_FetchedOnce` — many converter calls share a single catalogue fetch (the N+1 invariant).
+
+#### MEDIUM: Azure compute Details vCPU/MemoryGB require ComputeDetails type extension
+
+**File:** `providers/azure/services/compute/client.go::convertAzureVMRecommendation`, `pkg/common/types.go::ComputeDetails`
+
+**Description:** The cache/cosmos/database batched-SKU-catalogue lookup intentionally skipped compute. Reason: `common.ComputeDetails` currently exposes only `InstanceType`, `Platform`, `Tenancy`, `Scope`. The SKU-catalogue enrichments compute would supply (vCPU + MemoryGB from `armcompute.ResourceSKU.Capabilities`) have no struct field to write into. Adding fields to a shared `pkg/common` type touches every cloud provider's converter, the frontend rendering layer, the matchers, and existing snapshot/golden tests across all 3 cloud providers — well outside the scope of a perf change to the Azure converters.
+
+**Proposed fix:** (1) add `VCPU int` + `MemoryGB float64` fields to `common.ComputeDetails`; (2) update AWS + GCP converters to populate them where the SDK already exposes the data; (3) port the same lazy `cachedSKULookup` pattern from cache/cosmos/database into compute and wire the new fields through.
 
 **Status:** ❓ Needs triage
 

diff --git a/providers/azure/services/cache/client.go b/providers/azure/services/cache/client.go
@@ -10,6 +10,7 @@ import (
 	"net/http"
 	"net/url"
 	"strings"
+	"sync"
 	"time"
 
 	"github.com/Azure/azure-sdk-for-go/sdk/azcore"
@@ -19,11 +20,22 @@ import (
 	"github.com/google/uuid"
 
 	"github.com/LeanerCloud/CUDly/pkg/common"
+	"github.com/LeanerCloud/CUDly/pkg/logging"
 	"github.com/LeanerCloud/CUDly/providers/azure/internal/httpclient"
 	"github.com/LeanerCloud/CUDly/providers/azure/internal/pricing"
 	"github.com/LeanerCloud/CUDly/providers/azure/internal/recommendations"
 )
 
+// redisSKUEntry holds the SKU-catalogue-derived fields the converter
+// wants for each Redis SKU. Sourced from the cache's Properties:
+//   - shardCount: Properties.ShardCount (Premium-tier clustered caches).
+//
+// Non-clustered caches return shardCount=0; the converter treats 0 as
+// "unknown" and leaves CacheDetails.Shards at its zero value.
+type redisSKUEntry struct {
+	shardCount int
+}
+
 // HTTPClient interface for HTTP operations (enables mocking)
 type HTTPClient interface {
 	Do(req *http.Request) (*http.Response, error)
@@ -56,6 +68,16 @@ type CacheClient struct {
 	recommendationsPager RecommendationsPager
 	reservationsPager    ReservationsDetailsPager
 	redisCachesPager     RedisCachesPager
+
+	// Lazy SKU catalogue cache. armredis exposes no per-region "list all
+	// possible SKUs" surface, so we derive shard counts from the existing
+	// caches in the subscription (NewListBySubscriptionPager). Fetched
+	// ONCE per client lifetime; subsequent converter calls in the same
+	// GetRecommendations run hit the in-memory map. A failed fetch leaves
+	// skuCacheMap nil and converters fall back to Shards=0 with a WARN
+	// log — the conversion itself does NOT fail.
+	skuCacheOnce sync.Once
+	skuCacheMap  map[string]redisSKUEntry
 }
 
 // NewClient creates a new Azure Cache client
@@ -557,14 +579,23 @@ func extractRedisPricing(items []CacheRetailPriceItem, termYears int) (onDemand,
 // SDK-to-struct ladder. Returns nil when the SDK payload is unusable.
 //
 // Details populated by parsing the SKU string into Engine ("redis") and
-// NodeType. Shard count requires an armredis SKU catalogue lookup and is
-// deferred to a batched-enrichment follow-up tracked in
-// known_issues/10_azure_provider.md.
-func (c *CacheClient) convertAzureRedisRecommendation(_ context.Context, azureRec armconsumption.ReservationRecommendationClassification) *common.Recommendation {
+// NodeType, then enriched from the lazily-cached armredis catalogue
+// (cachedSKULookup). Shards is sourced from the catalogue when an
+// existing cache in the subscription matches the recommendation's
+// Premium-tier SKU; otherwise stays 0 (zero means "unknown", not
+// "definitely zero shards" — see the redisSKUEntry godoc).
+func (c *CacheClient) convertAzureRedisRecommendation(ctx context.Context, azureRec armconsumption.ReservationRecommendationClassification) *common.Recommendation {
 	f := recommendations.Extract(azureRec)
 	if f == nil {
 		return nil
 	}
+	details := common.CacheDetails{
+		Engine:   "redis",
+		NodeType: f.ResourceType,
+	}
+	if entry, ok := c.cachedSKULookup(ctx, f.ResourceType); ok && entry.shardCount > 0 {
+		details.Shards = entry.shardCount
+	}
 	return &common.Recommendation{
 		Provider:         common.ProviderAzure,
 		Service:          common.ServiceCache,
@@ -579,11 +610,74 @@ func (c *CacheClient) convertAzureRedisRecommendation(_ context.Context, azureRe
 		Term:             f.Term,
 		PaymentOption:    "upfront",
 		Timestamp:        time.Now(),
-		Details: common.CacheDetails{
-			Engine:   "redis",
-			NodeType: f.ResourceType,
-		},
+		Details:          details,
+	}
+}
+
+// cachedSKULookup returns the SKU catalogue entry for skuName, fetching
+// the catalogue lazily on first call. The catalogue is fetched ONCE per
+// client lifetime via armredis.Client.NewListBySubscriptionPager;
+// subsequent calls are O(1) map lookups. ok=false on cache miss OR
+// catalogue-fetch failure — the caller falls back to Shards=0 rather
+// than failing the whole conversion.
+//
+// Why source from existing caches: armredis exposes no "list all
+// possible SKUs" endpoint. The catalogue surface that does exist
+// (existing cache instances in the subscription) gives us authoritative
+// shard counts for the SKUs the customer actually uses, which is the
+// set the recommendation engine recommends from anyway.
+func (c *CacheClient) cachedSKULookup(ctx context.Context, skuName string) (redisSKUEntry, bool) {
+	c.skuCacheOnce.Do(func() {
+		c.skuCacheMap = c.fetchSKUCatalogue(ctx)
+	})
+	if c.skuCacheMap == nil {
+		return redisSKUEntry{}, false
+	}
+	entry, ok := c.skuCacheMap[skuName]
+	return entry, ok
+}
+
+// fetchSKUCatalogue performs the single ListBySubscription walk and
+// reduces the response into a name->redisSKUEntry map keyed by the
+// "<Tier>_<Family><Capacity>" SKU naming convention (e.g. "Premium_P1")
+// that matches the recommendation engine's ResourceType output.
+//
+// Returns nil on error so the sync.Once-gated cache field stays nil and
+// cachedSKULookup falls back to the empty-Details path. The fetch error
+// is logged WARN once.
+func (c *CacheClient) fetchSKUCatalogue(ctx context.Context) map[string]redisSKUEntry {
+	pager, err := c.createRedisCachesPager()
+	if err != nil {
+		logging.Warnf("azure cache: SKU catalogue pager create failed for region %s: %v — Details.Shards left at 0", c.region, err)
+		return nil
+	}
+	out := make(map[string]redisSKUEntry)
+	for pager.More() {
+		page, err := pager.NextPage(ctx)
+		if err != nil {
+			logging.Warnf("azure cache: SKU catalogue page fetch failed for region %s: %v — partial cache (%d entries) discarded, Details.Shards left at 0", c.region, err, len(out))
+			return nil
+		}
+		for _, cache := range page.Value {
+			fullSKU := extractSKUFromCache(cache)
+			if fullSKU == "" {
+				continue
+			}
+			shards := 0
+			if cache.Properties != nil && cache.Properties.ShardCount != nil {
+				shards = int(*cache.Properties.ShardCount)
+			}
+			// First-write-wins: if the same SKU appears on two caches
+			// with different ShardCounts, keep the first. Real Premium
+			// clusters configured to the same SKU typically share the
+			// shard count anyway — this just keeps the cache
+			// deterministic regardless of pager order.
+			if _, exists := out[fullSKU]; !exists {
+				out[fullSKU] = redisSKUEntry{shardCount: shards}
+			}
+		}
 	}
+	return out
 }
 
 // applyPurchaseAutomationTag attaches the purchase-automation tag to an Azure

diff --git a/providers/azure/services/cache/client_test.go b/providers/azure/services/cache/client_test.go
@@ -740,9 +740,14 @@ func TestCacheClient_ConvertAzureRedisRecommendation_NilGuards(t *testing.T) {
 
 // TestCacheClient_ConvertAzureRedisRecommendation_PopulatesAllFields asserts
 // the converter forwards every helper-extracted field + applies the
-// Cache-service-specific constants.
+// Cache-service-specific constants. An empty SKU catalogue (no caches in
+// the subscription) is the "no signal" baseline — Shards stays 0.
 func TestCacheClient_ConvertAzureRedisRecommendation_PopulatesAllFields(t *testing.T) {
 	client := NewClient(nil, "test-subscription", "eastus")
+	// Inject empty mock pager so the cachedSKULookup hits the lazy-fetch
+	// codepath without making a real Azure API call.
+	client.SetRedisCachesPager(&MockRedisCachesPager{})
+
 	rec := mocks.BuildLegacyReservationRecommendation(
 		mocks.WithRegion("eastus"),
 		mocks.WithTerm("P3Y"),
@@ -765,14 +770,131 @@ func TestCacheClient_ConvertAzureRedisRecommendation_PopulatesAllFields(t *testi
 	assert.Equal(t, "3yr", out.Term)
 	assert.Equal(t, "upfront", out.PaymentOption)
 
-	// Details carries Engine=redis + NodeType from the SKU string.
-	// Shard count is deferred to batched enrichment.
+	// Details carries Engine=redis + NodeType from the SKU string. Shards
+	// stays 0 when no cache instance matches in the subscription (the
+	// catalogue's only signal source for shard counts).
 	require.NotNil(t, out.Details)
 	details, ok := out.Details.(common.CacheDetails)
 	require.True(t, ok, "Details must be a common.CacheDetails value")
 	assert.Equal(t, "redis", details.Engine)
 	assert.Equal(t, "Premium_P3", details.NodeType)
-	assert.Equal(t, 0, details.Shards, "Shards is deferred to batched enrichment")
+	assert.Equal(t, 0, details.Shards, "Shards is 0 when no matching cache exists in the subscription")
+}
+
+// TestCacheClient_ConvertAzureRedisRecommendation_PopulatesShardsFromSKUCache
+// asserts the new batched-SKU-catalogue lookup populates CacheDetails.Shards
+// when the subscription has a Premium-tier clustered cache with the same
+// SKU as the recommendation. Single ListBySubscription call per client
+// lifetime feeds many converter calls; this test pins the contract.
+func TestCacheClient_ConvertAzureRedisRecommendation_PopulatesShardsFromSKUCache(t *testing.T) {
+	client := NewClient(nil, "test-subscription", "eastus")
+
+	premiumName := armredis.SKUNamePremium
+	familyP := armredis.SKUFamilyP
+	capacity := int32(3)
+	shardCount := int32(5)
+
+	// One Premium_P3 cache with 5 shards in the subscription. The
+	// converter will see f.ResourceType="Premium_P3" and look it up.
+	mockPager := &MockRedisCachesPager{
+		pages: []armredis.ClientListBySubscriptionResponse{
+			{
+				ListResult: armredis.ListResult{
+					Value: []*armredis.ResourceInfo{
+						{
+							Properties: &armredis.Properties{
+								SKU: &armredis.SKU{
+									Name:     &premiumName,
+									Family:   &familyP,
+									Capacity: &capacity,
+								},
+								ShardCount: &shardCount,
+							},
+						},
+					},
+				},
+			},
+		},
+	}
+	client.SetRedisCachesPager(mockPager)
+
+	rec := mocks.BuildLegacyReservationRecommendation(
+		mocks.WithRegion("eastus"),
+		mocks.WithNormalizedSize("Premium_P3"),
+	)
+	out := client.convertAzureRedisRecommendation(context.Background(), rec)
+	require.NotNil(t, out)
+	details, ok := out.Details.(common.CacheDetails)
+	require.True(t, ok)
+	assert.Equal(t, "redis", details.Engine)
+	assert.Equal(t, "Premium_P3", details.NodeType)
+	assert.Equal(t, 5, details.Shards, "Shards must be enriched from the cached SKU catalogue")
+}
+
+// TestCacheClient_ConvertAzureRedisRecommendation_PagerErrorFallsBack
+// asserts that a SKU-catalogue fetch failure does NOT fail the
+// conversion — Shards just stays at 0 and the rest of Details is
+// populated from the recommendation payload as before. This is the
+// graceful-degradation contract the issue asks for.
+func TestCacheClient_ConvertAzureRedisRecommendation_PagerErrorFallsBack(t *testing.T) {
+	client := NewClient(nil, "test-subscription", "eastus")
+	mockPager := &MockRedisCachesPager{
+		pages: []armredis.ClientListBySubscriptionResponse{{}},
+		err:   errors.New("transient Azure API error"),
+	}
+	client.SetRedisCachesPager(mockPager)
+
+	rec := mocks.BuildLegacyReservationRecommendation(
+		mocks.WithRegion("eastus"),
+		mocks.WithNormalizedSize("Premium_P3"),
+	)
+	out := client.convertAzureRedisRecommendation(context.Background(), rec)
+	require.NotNil(t, out, "conversion must NOT fail on catalogue-fetch error")
+	details, ok := out.Details.(common.CacheDetails)
+	require.True(t, ok)
+	assert.Equal(t, "redis", details.Engine)
+	assert.Equal(t, "Premium_P3", details.NodeType)
+	assert.Equal(t, 0, details.Shards, "Shards left at 0 when catalogue fetch fails")
+}
+
+// TestCacheClient_CachedSKULookup_FetchedOnce pins the perf invariant:
+// many converter calls in the same GetRecommendations run trigger
+// exactly ONE catalogue fetch. The mock pager counts pages served; the
+// assertion verifies the count stays at 1 after multiple lookups.
+func TestCacheClient_CachedSKULookup_FetchedOnce(t *testing.T) {
+	client := NewClient(nil, "test-subscription", "eastus")
+
+	premiumName := armredis.SKUNamePremium
+	familyP := armredis.SKUFamilyP
+	capacity := int32(1)
+	shardCount := int32(2)
+
+	mockPager := &MockRedisCachesPager{
+		pages: []armredis.ClientListBySubscriptionResponse{
+			{
+				ListResult: armredis.ListResult{
+					Value: []*armredis.ResourceInfo{
+						{
+							Properties: &armredis.Properties{
+								SKU: &armredis.SKU{
+									Name:     &premiumName,
+									Family:   &familyP,
+									Capacity: &capacity,
+								},
+								ShardCount: &shardCount,
+							},
+						},
+					},
+				},
+			},
+		},
+	}
+	client.SetRedisCachesPager(mockPager)
+
+	for i := 0; i < 10; i++ {
+		_, _ = client.cachedSKULookup(context.Background(), "Premium_P1")
+	}
+	assert.Equal(t, 1, mockPager.index, "catalogue must be fetched ONCE regardless of lookup count")
 }
 
 // Test the to package is properly imported (used in tests)

diff --git a/providers/azure/services/compute/client.go b/providers/azure/services/compute/client.go
@@ -672,11 +672,15 @@ func extractVMPricing(items []AzureRetailPriceItem, termYears int) (onDemand, re
 // providers/azure/internal/recommendations so the four Azure service
 // converters share the same type-assertion + nil-guard ladder.
 //
-// Details populated from the reservation recommendation payload only —
-// Platform / Tenancy / Scope (and the vCPU / memory counts a UI would
-// want) require an ARM SKU-catalogue lookup; populating them here would
-// trigger an N+1 armcompute.ResourceSKUsClient.ListByLocation per
-// recommendation. That batched-enrichment is tracked as a follow-up in
+// Compute is NOT yet wired to the batched-SKU-catalogue lookup landed
+// for cache/cosmosdb/database. Reason: common.ComputeDetails (in
+// pkg/common/types.go) currently only exposes InstanceType, Platform,
+// Tenancy, Scope. The two enrichments the SKU catalogue would supply
+// for compute are vCPU and MemoryGB — neither of which has a field on
+// ComputeDetails to write into. Adding new fields to a shared struct
+// touches every cloud provider + every consumer (frontend, common
+// matchers, three other providers' converters), which is well outside
+// the scope of this perf change. Tracked as a follow-up — see
 // known_issues/10_azure_provider.md.
 func (c *ComputeClient) convertAzureVMRecommendation(_ context.Context, azureRec armconsumption.ReservationRecommendationClassification) *common.Recommendation {
 	f := recommendations.Extract(azureRec)
@@ -699,7 +703,7 @@ func (c *ComputeClient) convertAzureVMRecommendation(_ context.Context, azureRec
 		Timestamp:        time.Now(),
 		// Only InstanceType is safely derivable from the payload. Platform,
 		// Tenancy, Scope are armcompute.ResourceSKUsClient territory and
-		// are deferred to the batched-enrichment follow-up.
+		// are deferred — see the function godoc for the scope decision.
 		Details: common.ComputeDetails{
 			InstanceType: f.ResourceType,
 		},