perf(azure): parallelize 5 sequential service calls in GetRecommendations (closes #258)#259
Conversation
…ions (closes #258) The dispatcher in providers/azure/recommendations.go::GetRecommendations was running 5 service calls (compute / database / cache / cosmosdb / advisor) one after another. Each per-service call internally hits two slow Azure endpoints (ARM Consumption Reservation Recommendations ~5-15s, prices.azure.com retail pricing pagination ~5-10s), totalling 50-125s end-to-end. This consistently exceeded the API Lambda's 60s timeout, producing 502s on user-triggered refresh. The five calls have no inter-service dependency, so they can run concurrently under errgroup.WithContext. Each goroutine captures its own error in a closure-scoped variable and returns nil to the group so that one service's failure does not cancel its siblings (matching the previous sequential behaviour where each error was logged via logging.Warnf and the loop continued). Results are appended in the original deterministic order (compute → database → cache → cosmosdb → advisor) after Wait() so any order-sensitive callers stay stable. Expected end-to-end after this change: ~max(individual call durations) ≈ 15-20s, well within the 60s timeout.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe Azure provider's Go toolchain was updated to version 1.25.0 with the addition of ChangesParallel Recommendations Fetching
Sequence DiagramsequenceDiagram
participant Main as Main Flow
participant EG as errgroup.WithContext
participant Compute as Compute Service
participant DB as DB Service
participant Cache as Cache Service
participant Cosmos as CosmosDB Service
participant Advisor as Advisor Service
participant Result as Result Collector
Main->>EG: Create errgroup with context
EG->>Compute: Spawn goroutine
EG->>DB: Spawn goroutine
EG->>Cache: Spawn goroutine
EG->>Cosmos: Spawn goroutine
EG->>Advisor: Spawn goroutine
par Parallel Execution
Compute->>Compute: Fetch recommendations
Compute->>Result: Store result/error
and
DB->>DB: Fetch recommendations
DB->>Result: Store result/error
and
Cache->>Cache: Fetch recommendations
Cache->>Result: Store result/error
and
Cosmos->>Cosmos: Fetch recommendations
Cosmos->>Result: Store result/error
and
Advisor->>Advisor: Fetch recommendations
Advisor->>Result: Store result/error
end
EG->>Main: Wait for all goroutines
Main->>Main: Log per-service warnings/errors
Main->>Main: Append successful results in order
Main->>Main: Return aggregated recommendations
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@providers/azure/recommendations.go`:
- Around line 61-64: Add a concurrency regression test/benchmark that uses a
fake client for each service (compute, database, cache, cosmosdb, advisor) where
each fake sleeps a configurable duration; call the recommendations orchestration
(the code path that uses errgroup and launches goroutines) and assert the total
elapsed time is approximately the max single-client sleep (not the sum) to
ensure calls run concurrently, and also assert the returned results preserve the
deterministic order (compute → database → cache → cosmosdb → advisor); make this
a fast unit test with small sleeps and a benchmark variant to detect
regressions.
- Around line 71-119: The goroutines swallow context cancellation because every
g.Go returns nil and the code discards g.Wait(); change the final wait handling
so that after g.Wait() you check ctx.Err() (or gctx.Err()) and propagate it if
non-nil. Concretely, replace "_ = g.Wait()" with err := g.Wait(); if err != nil
{ return err } and/or if ctx.Err() != nil { return ctx.Err() } (preserving the
per-service error variables and logging) so that context.Canceled or
context.DeadlineExceeded from gctx/ctx is returned to callers; keep each
anonymous goroutine bodies (where computeClient.GetRecommendations,
databaseClient.GetRecommendations, cacheClient.GetRecommendations,
cosmosdb.NewClient, r.getAdvisorRecommendations, etc.) returning nil to retain
per-service error isolation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 51d7207e-5ea2-4489-b1e5-bc949490455d
⛔ Files ignored due to path filters (1)
providers/azure/go.sumis excluded by!**/*.sum
📒 Files selected for processing (2)
providers/azure/go.modproviders/azure/recommendations.go
| // All five service calls run concurrently under errgroup. Each goroutine captures | ||
| // its own error and returns nil to the group so that a single service failure | ||
| // does not cancel sibling calls. Results are appended in a deterministic order | ||
| // (compute → database → cache → cosmosdb → advisor) after all goroutines finish. |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Please pin the latency win with a concurrency regression test.
This refactor’s main value is wall-clock behavior, but nothing in the diff guards against accidental re-serialization later. A fake-client test/benchmark that sleeps per service and asserts total duration stays near the max single-call latency, while result order remains deterministic, would make this change much safer to maintain.
Also applies to: 65-159
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@providers/azure/recommendations.go` around lines 61 - 64, Add a concurrency
regression test/benchmark that uses a fake client for each service (compute,
database, cache, cosmosdb, advisor) where each fake sleeps a configurable
duration; call the recommendations orchestration (the code path that uses
errgroup and launches goroutines) and assert the total elapsed time is
approximately the max single-client sleep (not the sum) to ensure calls run
concurrently, and also assert the returned results preserve the deterministic
order (compute → database → cache → cosmosdb → advisor); make this a fast unit
test with small sleeps and a benchmark variant to detect regressions.
| g, gctx := errgroup.WithContext(ctx) | ||
|
|
||
| // Compute (VM) recommendations — subscription-wide. | ||
| if shouldIncludeService(params, common.ServiceCompute) { | ||
| computeClient := compute.NewClient(r.cred, r.subscriptionID, "") | ||
| computeRecs, err := computeClient.GetRecommendations(ctx, params) | ||
| if err != nil { | ||
| logging.Warnf("Azure compute recommendations: %v", err) | ||
| } else { | ||
| allRecommendations = append(allRecommendations, computeRecs...) | ||
| } | ||
| g.Go(func() error { | ||
| computeClient := compute.NewClient(r.cred, r.subscriptionID, "") | ||
| computeRecs, computeErr = computeClient.GetRecommendations(gctx, params) | ||
| return nil // error isolation: never propagate to errgroup | ||
| }) | ||
| } | ||
|
|
||
| // Database (SQL) recommendations — subscription-wide. | ||
| if shouldIncludeService(params, common.ServiceRelationalDB) { | ||
| dbClient := database.NewClient(r.cred, r.subscriptionID, "") | ||
| dbRecs, err := dbClient.GetRecommendations(ctx, params) | ||
| if err != nil { | ||
| logging.Warnf("Azure database recommendations: %v", err) | ||
| } else { | ||
| allRecommendations = append(allRecommendations, dbRecs...) | ||
| } | ||
| g.Go(func() error { | ||
| dbClient := database.NewClient(r.cred, r.subscriptionID, "") | ||
| dbRecs, dbErr = dbClient.GetRecommendations(gctx, params) | ||
| return nil | ||
| }) | ||
| } | ||
|
|
||
| // Cache (Redis) recommendations — subscription-wide. | ||
| if shouldIncludeService(params, common.ServiceCache) { | ||
| cacheClient := cache.NewClient(r.cred, r.subscriptionID, "") | ||
| cacheRecs, err := cacheClient.GetRecommendations(ctx, params) | ||
| if err != nil { | ||
| logging.Warnf("Azure cache recommendations: %v", err) | ||
| } else { | ||
| allRecommendations = append(allRecommendations, cacheRecs...) | ||
| } | ||
| g.Go(func() error { | ||
| cacheClient := cache.NewClient(r.cred, r.subscriptionID, "") | ||
| cacheRecs, cacheErr = cacheClient.GetRecommendations(gctx, params) | ||
| return nil | ||
| }) | ||
| } | ||
|
|
||
| // CosmosDB (NoSQL) recommendations — subscription-wide. | ||
| if shouldIncludeService(params, common.ServiceNoSQL) { | ||
| g.Go(func() error { | ||
| cosmosClient := cosmosdb.NewClient(r.cred, r.subscriptionID, "") | ||
| cosmosRecs, cosmosErr = cosmosClient.GetRecommendations(gctx, params) | ||
| return nil | ||
| }) | ||
| } | ||
|
|
||
| // Azure Advisor adds cross-cutting cost recommendations independent of the | ||
| // per-service Reservation API. Failures here are non-fatal — the per-service | ||
| // results above are still useful on their own. | ||
| advisorRecs, err := r.getAdvisorRecommendations(ctx, params) | ||
| if err != nil { | ||
| logging.Errorf("Failed to get Azure Advisor recommendations: %v", err) | ||
| g.Go(func() error { | ||
| advisorRecs, advisorErr = r.getAdvisorRecommendations(gctx, params) | ||
| return nil | ||
| }) | ||
|
|
||
| // Wait for all goroutines. g.Wait() always returns nil because every goroutine | ||
| // returns nil — errors are captured in per-service variables above. | ||
| _ = g.Wait() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
golang.org/x/sync/errgroup WithContext documentation: when is the derived context canceled, and what does Wait return if every goroutine returns nil but the parent context is canceled?
💡 Result:
When using golang.org/x/sync/errgroup WithContext: 1) When is the derived context canceled? - The derived context returned by WithContext is canceled the first time either: a) any function passed to Go returns a non-nil error, or b) Wait returns. This cancellation happens “whichever occurs first.” 2) What does Wait return if every goroutine returns nil but the parent context is canceled? - Wait only returns the first non-nil error produced by the goroutines (g.err). If all goroutines return nil, then g.err stays nil. - Wait does still cancel the derived context when it returns (it calls g.cancel(g.err) with g.err possibly nil), but Wait’s return value remains nil. So: if every goroutine returns nil, g.Wait returns nil even if the parent context was canceled (assuming no goroutine returned a non-nil error). If you need Wait to return something like ctx.Err on parent cancellation, you must ensure at least one goroutine returns ctx.Err (or otherwise propagate it), because errgroup.Wait does not automatically return ctx.Err just because the parent context was canceled.
Citations:
- 1: https://go.pact.im/doc/src/golang.org/x/sync/errgroup/errgroup.go.html
- 2: https://go.dev/src/cmd/vendor/golang.org/x/sync/errgroup/errgroup.go
- 3: https://godoc.org/golang.org/x/sync/errgroup
- 4: https://github.com/golang/sync/blob/master/errgroup/errgroup_test.go
🏁 Script executed:
wc -l providers/azure/recommendations.goRepository: LeanerCloud/CUDly
Length of output: 100
🏁 Script executed:
sed -n '150,165p' providers/azure/recommendations.goRepository: LeanerCloud/CUDly
Length of output: 598
🏁 Script executed:
sed -n '120,165p' providers/azure/recommendations.goRepository: LeanerCloud/CUDly
Length of output: 1644
🏁 Script executed:
sed -n '71,75p' providers/azure/recommendations.goRepository: LeanerCloud/CUDly
Length of output: 234
🏁 Script executed:
sed -n '50,75p' providers/azure/recommendations.goRepository: LeanerCloud/CUDly
Length of output: 1589
🏁 Script executed:
sed -n '115,162p' providers/azure/recommendations.goRepository: LeanerCloud/CUDly
Length of output: 1751
Propagate context cancellation instead of returning success on timeout.
Every goroutine returns nil, so g.Wait() cannot surface context.Canceled or context.DeadlineExceeded. If the parent context times out before all results are collected, line 158 still returns nil, allowing callers to treat an empty or partial result set as a clean success.
After g.Wait() at line 115, check ctx.Err() to propagate parent cancellation while preserving the per-service error logging behavior.
Suggested fix
// Wait for all goroutines. g.Wait() always returns nil because every goroutine
// returns nil — errors are captured in per-service variables above.
_ = g.Wait()
// Log per-service errors (matches previous sequential behaviour). Append
// results in the same deterministic order as the original sequential code
// (compute → database → cache → cosmosdb → advisor) so that any
// order-sensitive callers remain stable.
allRecommendations := make([]common.Recommendation, 0,
len(computeRecs)+len(dbRecs)+len(cacheRecs)+len(cosmosRecs)+len(advisorRecs))
if computeErr != nil {
logging.Warnf("Azure compute recommendations: %v", computeErr)
} else {
allRecommendations = append(allRecommendations, computeRecs...)
}
if dbErr != nil {
logging.Warnf("Azure database recommendations: %v", dbErr)
} else {
allRecommendations = append(allRecommendations, dbRecs...)
}
if cacheErr != nil {
logging.Warnf("Azure cache recommendations: %v", cacheErr)
} else {
allRecommendations = append(allRecommendations, cacheRecs...)
}
if cosmosErr != nil {
logging.Warnf("Azure cosmosdb recommendations: %v", cosmosErr)
} else {
allRecommendations = append(allRecommendations, cosmosRecs...)
}
if advisorErr != nil {
logging.Errorf("Failed to get Azure Advisor recommendations: %v", advisorErr)
} else {
allRecommendations = append(allRecommendations, advisorRecs...)
}
+ if err := ctx.Err(); err != nil {
+ return allRecommendations, err
+ }
return allRecommendations, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@providers/azure/recommendations.go` around lines 71 - 119, The goroutines
swallow context cancellation because every g.Go returns nil and the code
discards g.Wait(); change the final wait handling so that after g.Wait() you
check ctx.Err() (or gctx.Err()) and propagate it if non-nil. Concretely, replace
"_ = g.Wait()" with err := g.Wait(); if err != nil { return err } and/or if
ctx.Err() != nil { return ctx.Err() } (preserving the per-service error
variables and logging) so that context.Canceled or context.DeadlineExceeded from
gctx/ctx is returned to callers; keep each anonymous goroutine bodies (where
computeClient.GetRecommendations, databaseClient.GetRecommendations,
cacheClient.GetRecommendations, cosmosdb.NewClient, r.getAdvisorRecommendations,
etc.) returning nil to retain per-service error isolation.
…#258) Two CodeRabbit findings on PR #259: 1. providers/azure/recommendations.go::GetRecommendations now checks ctx.Err() after g.Wait() and propagates context.Canceled / context.DeadlineExceeded to callers. Without this, every per- service goroutine returns nil to the errgroup (intentional, for error isolation across services), so the parent ctx being cancelled mid-fan-out would silently produce "all services completed cleanly" when in fact the caller's deadline expired. 2. New test TestRecommendationsClientAdapter_GetRecommendations_ PropagatesContextCancellation pins the contract: calling GetRecommendations with a pre-cancelled context must return context.Canceled, not nil. Result-merging extracted into mergeServiceResults helper + serviceResult struct so GetRecommendations stays under the project's cyclomatic-complexity gate after the post-Wait ctx.Err() block was added. Behaviour is identical to the previous inline merge (per- service errors logged in the same order with the same severity, then successful results appended in canonical compute -> database -> cache -> cosmosdb -> advisor order). CR also requested a concurrency timing test using fake clients per service to assert total elapsed approx max-single-client (proving the errgroup actually parallelises). Deferred to a follow-up issue because the existing service-client constructors build concrete Azure SDK types with no injection seam.
CodeRabbit pass-1 responseFixed (pushed in 7c43b05):
Result-merging extracted into a Deferred to a follow-up:
@coderabbitai full review |
|
Tip For best results, initiate chat on the files or code changes.
The deferral reasoning for the concurrency timing test is sound. Without an injection seam in the service constructors, any meaningful "elapsed ≈ max-single-service" assertion would need to mock at the package-function level, which is a broader refactor than belongs in this PR. Triggering a full review of the updated state now. [full_review] ✏️ Learnings added
|
Summary
The Azure recommendations dispatcher (
providers/azure/recommendations.go::GetRecommendations) was running 5 service calls in series — compute, database (SQL), cache (Redis), cosmosdb, and Azure Advisor. Each call internally hits two slow Azure endpoints (ARM Consumption ~5–15s +prices.azure.compagination ~5–10s), totalling 50–125s end-to-end.This consistently exceeded the API Lambda's 60s timeout, producing 502s on user-triggered refresh (see #257 for the broader async-refresh architecture; this PR is the orthogonal quick-win).
Change
The 5 calls have no inter-service dependency, so they now run concurrently under
errgroup.WithContext. Each goroutine captures its own error in a closure-scoped variable and returnsnilto the group so that one service's failure does not cancel its siblings — matching the previous sequential behaviour where errors were logged vialogging.Warnfand the loop continued.Preserved invariants
shouldIncludeServicefiltering: unchanged.Wait()and the corresponding service's results are skipped from the merged slice.gctxfromerrgroup.WithContextis passed to each goroutine so a parent cancellation cascades.Expected timing
Before: ~50–125s sequential (5 × 5–25s per service).
After: ~max(individual call durations) ≈ 15–20s.
Test plan
go build ./providers/azure/...succeedsOut of scope
prices.azure.comSKU catalogue across requests (separate, larger effort)Closes #258.
Summary by CodeRabbit
Chores
Performance Improvements