Skip to content

perf(azure): parallelize 5 sequential service calls in GetRecommendations (closes #258)#259

Merged
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
perf/issue-258-azure-parallel
May 4, 2026
Merged

perf(azure): parallelize 5 sequential service calls in GetRecommendations (closes #258)#259
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
perf/issue-258-azure-parallel

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 4, 2026

Summary

The Azure recommendations dispatcher (providers/azure/recommendations.go::GetRecommendations) was running 5 service calls in series — compute, database (SQL), cache (Redis), cosmosdb, and Azure Advisor. Each call internally hits two slow Azure endpoints (ARM Consumption ~5–15s + prices.azure.com pagination ~5–10s), totalling 50–125s end-to-end.

This consistently exceeded the API Lambda's 60s timeout, producing 502s on user-triggered refresh (see #257 for the broader async-refresh architecture; this PR is the orthogonal quick-win).

Change

The 5 calls have no inter-service dependency, so they now run concurrently under errgroup.WithContext. Each goroutine captures its own error in a closure-scoped variable and returns nil to the group so that one service's failure does not cancel its siblings — matching the previous sequential behaviour where errors were logged via logging.Warnf and the loop continued.

Preserved invariants

  • Deterministic result order: compute → database → cache → cosmosdb → advisor, same as before, so any order-sensitive callers stay stable.
  • Per-service shouldIncludeService filtering: unchanged.
  • Per-service error isolation: unchanged. Errors are logged after Wait() and the corresponding service's results are skipped from the merged slice.
  • gctx from errgroup.WithContext is passed to each goroutine so a parent cancellation cascades.

Expected timing

Before: ~50–125s sequential (5 × 5–25s per service).
After: ~max(individual call durations) ≈ 15–20s.

Test plan

Out of scope

Closes #258.

Summary by CodeRabbit

  • Chores

    • Updated Go toolchain from version 1.23.0 to 1.25.0
    • Added golang.org/x/sync v0.20.0 as a new dependency
  • Performance Improvements

    • Recommendation retrieval process has been enhanced to fetch data concurrently from multiple Azure services including Compute, Relational Database, Cache, CosmosDB, and Advisor. This optimization reduces overall response times and delivers noticeably faster recommendations to users.

…ions (closes #258)

The dispatcher in providers/azure/recommendations.go::GetRecommendations
was running 5 service calls (compute / database / cache / cosmosdb /
advisor) one after another. Each per-service call internally hits two
slow Azure endpoints (ARM Consumption Reservation Recommendations
~5-15s, prices.azure.com retail pricing pagination ~5-10s), totalling
50-125s end-to-end. This consistently exceeded the API Lambda's 60s
timeout, producing 502s on user-triggered refresh.

The five calls have no inter-service dependency, so they can run
concurrently under errgroup.WithContext. Each goroutine captures its
own error in a closure-scoped variable and returns nil to the group so
that one service's failure does not cancel its siblings (matching the
previous sequential behaviour where each error was logged via
logging.Warnf and the loop continued).

Results are appended in the original deterministic order
(compute → database → cache → cosmosdb → advisor) after Wait() so any
order-sensitive callers stay stable.

Expected end-to-end after this change: ~max(individual call durations)
≈ 15-20s, well within the 60s timeout.
@cristim cristim added priority/p2 Backlog-worthy severity/medium Moderate harm urgency/this-sprint Within the current sprint impact/many Affects most users effort/xs Trivial / one-liner type/bug Defect triaged Item has been triaged labels May 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7ead68a-9599-4d06-bc67-49708ab6dc44

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The Azure provider's Go toolchain was updated to version 1.25.0 with the addition of golang.org/x/sync dependency. The GetRecommendations function was refactored to execute five sequential Azure service API calls (Compute, DB, Cache, CosmosDB, Advisor) concurrently using errgroup.WithContext, reducing expected end-to-end latency from 50–125s to approximately 15–20s.

Changes

Parallel Recommendations Fetching

Layer / File(s) Summary
Toolchain & Dependencies
providers/azure/go.mod
Go version bumped from 1.23.0 to 1.25.0; golang.org/x/sync v0.20.0 added as a direct dependency.
Concurrent Service Fetching
providers/azure/recommendations.go
Imports golang.org/x/sync/errgroup. GetRecommendations refactored to spawn five concurrent goroutines for compute, database, cache, cosmosdb, and advisor clients under a shared errgroup.WithContext, each storing results independently without blocking siblings on error. All goroutines complete before logging per-service warnings and appending successful results in deterministic order; slice capacity is preallocated based on collected result lengths.

Sequence Diagram

sequenceDiagram
    participant Main as Main Flow
    participant EG as errgroup.WithContext
    participant Compute as Compute Service
    participant DB as DB Service
    participant Cache as Cache Service
    participant Cosmos as CosmosDB Service
    participant Advisor as Advisor Service
    participant Result as Result Collector

    Main->>EG: Create errgroup with context
    
    EG->>Compute: Spawn goroutine
    EG->>DB: Spawn goroutine
    EG->>Cache: Spawn goroutine
    EG->>Cosmos: Spawn goroutine
    EG->>Advisor: Spawn goroutine
    
    par Parallel Execution
        Compute->>Compute: Fetch recommendations
        Compute->>Result: Store result/error
    and
        DB->>DB: Fetch recommendations
        DB->>Result: Store result/error
    and
        Cache->>Cache: Fetch recommendations
        Cache->>Result: Store result/error
    and
        Cosmos->>Cosmos: Fetch recommendations
        Cosmos->>Result: Store result/error
    and
        Advisor->>Advisor: Fetch recommendations
        Advisor->>Result: Store result/error
    end
    
    EG->>Main: Wait for all goroutines
    Main->>Main: Log per-service warnings/errors
    Main->>Main: Append successful results in order
    Main->>Main: Return aggregated recommendations
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

effort/m, type/feat

Poem

🐰 Five services, once in line,
Now dance together, intertwined!
Errgroup spins them side-by-side,
Fifty seconds trimmed to a stride.
Parallelism's gentle art,
Speeds the Azure recommender's heart. 🚀

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The code changes parallelize the 5 service calls using errgroup as required [#258], preserve deterministic result ordering and per-service error isolation, but the PR summary acknowledges that test/benchmark is pending CI verification. Verify that the benchmark/integration test asserting GetRecommendations duration < (sum × 0.6) has been implemented and passes to satisfy the full test plan requirement from #258.
Out of Scope Changes check ❓ Inconclusive The PR modifies go.mod to add golang.org/x/sync v0.20.0 and update Go version, which is necessary for errgroup usage but go.mod changes should be verified as scope-appropriate. Confirm the go.mod changes (golang.org/x/sync addition and Go version update) are intentional dependencies for the parallelization approach and not collateral scope creep.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: parallelizing 5 sequential service calls in GetRecommendations, and matches the issue being closed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/issue-258-azure-parallel

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@providers/azure/recommendations.go`:
- Around line 61-64: Add a concurrency regression test/benchmark that uses a
fake client for each service (compute, database, cache, cosmosdb, advisor) where
each fake sleeps a configurable duration; call the recommendations orchestration
(the code path that uses errgroup and launches goroutines) and assert the total
elapsed time is approximately the max single-client sleep (not the sum) to
ensure calls run concurrently, and also assert the returned results preserve the
deterministic order (compute → database → cache → cosmosdb → advisor); make this
a fast unit test with small sleeps and a benchmark variant to detect
regressions.
- Around line 71-119: The goroutines swallow context cancellation because every
g.Go returns nil and the code discards g.Wait(); change the final wait handling
so that after g.Wait() you check ctx.Err() (or gctx.Err()) and propagate it if
non-nil. Concretely, replace "_ = g.Wait()" with err := g.Wait(); if err != nil
{ return err } and/or if ctx.Err() != nil { return ctx.Err() } (preserving the
per-service error variables and logging) so that context.Canceled or
context.DeadlineExceeded from gctx/ctx is returned to callers; keep each
anonymous goroutine bodies (where computeClient.GetRecommendations,
databaseClient.GetRecommendations, cacheClient.GetRecommendations,
cosmosdb.NewClient, r.getAdvisorRecommendations, etc.) returning nil to retain
per-service error isolation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51d7207e-5ea2-4489-b1e5-bc949490455d

📥 Commits

Reviewing files that changed from the base of the PR and between c8c3ba9 and aff26b8.

⛔ Files ignored due to path filters (1)
  • providers/azure/go.sum is excluded by !**/*.sum
📒 Files selected for processing (2)
  • providers/azure/go.mod
  • providers/azure/recommendations.go

Comment on lines +61 to +64
// All five service calls run concurrently under errgroup. Each goroutine captures
// its own error and returns nil to the group so that a single service failure
// does not cancel sibling calls. Results are appended in a deterministic order
// (compute → database → cache → cosmosdb → advisor) after all goroutines finish.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Please pin the latency win with a concurrency regression test.

This refactor’s main value is wall-clock behavior, but nothing in the diff guards against accidental re-serialization later. A fake-client test/benchmark that sleeps per service and asserts total duration stays near the max single-call latency, while result order remains deterministic, would make this change much safer to maintain.

Also applies to: 65-159

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@providers/azure/recommendations.go` around lines 61 - 64, Add a concurrency
regression test/benchmark that uses a fake client for each service (compute,
database, cache, cosmosdb, advisor) where each fake sleeps a configurable
duration; call the recommendations orchestration (the code path that uses
errgroup and launches goroutines) and assert the total elapsed time is
approximately the max single-client sleep (not the sum) to ensure calls run
concurrently, and also assert the returned results preserve the deterministic
order (compute → database → cache → cosmosdb → advisor); make this a fast unit
test with small sleeps and a benchmark variant to detect regressions.

Comment on lines +71 to +119
g, gctx := errgroup.WithContext(ctx)

// Compute (VM) recommendations — subscription-wide.
if shouldIncludeService(params, common.ServiceCompute) {
computeClient := compute.NewClient(r.cred, r.subscriptionID, "")
computeRecs, err := computeClient.GetRecommendations(ctx, params)
if err != nil {
logging.Warnf("Azure compute recommendations: %v", err)
} else {
allRecommendations = append(allRecommendations, computeRecs...)
}
g.Go(func() error {
computeClient := compute.NewClient(r.cred, r.subscriptionID, "")
computeRecs, computeErr = computeClient.GetRecommendations(gctx, params)
return nil // error isolation: never propagate to errgroup
})
}

// Database (SQL) recommendations — subscription-wide.
if shouldIncludeService(params, common.ServiceRelationalDB) {
dbClient := database.NewClient(r.cred, r.subscriptionID, "")
dbRecs, err := dbClient.GetRecommendations(ctx, params)
if err != nil {
logging.Warnf("Azure database recommendations: %v", err)
} else {
allRecommendations = append(allRecommendations, dbRecs...)
}
g.Go(func() error {
dbClient := database.NewClient(r.cred, r.subscriptionID, "")
dbRecs, dbErr = dbClient.GetRecommendations(gctx, params)
return nil
})
}

// Cache (Redis) recommendations — subscription-wide.
if shouldIncludeService(params, common.ServiceCache) {
cacheClient := cache.NewClient(r.cred, r.subscriptionID, "")
cacheRecs, err := cacheClient.GetRecommendations(ctx, params)
if err != nil {
logging.Warnf("Azure cache recommendations: %v", err)
} else {
allRecommendations = append(allRecommendations, cacheRecs...)
}
g.Go(func() error {
cacheClient := cache.NewClient(r.cred, r.subscriptionID, "")
cacheRecs, cacheErr = cacheClient.GetRecommendations(gctx, params)
return nil
})
}

// CosmosDB (NoSQL) recommendations — subscription-wide.
if shouldIncludeService(params, common.ServiceNoSQL) {
g.Go(func() error {
cosmosClient := cosmosdb.NewClient(r.cred, r.subscriptionID, "")
cosmosRecs, cosmosErr = cosmosClient.GetRecommendations(gctx, params)
return nil
})
}

// Azure Advisor adds cross-cutting cost recommendations independent of the
// per-service Reservation API. Failures here are non-fatal — the per-service
// results above are still useful on their own.
advisorRecs, err := r.getAdvisorRecommendations(ctx, params)
if err != nil {
logging.Errorf("Failed to get Azure Advisor recommendations: %v", err)
g.Go(func() error {
advisorRecs, advisorErr = r.getAdvisorRecommendations(gctx, params)
return nil
})

// Wait for all goroutines. g.Wait() always returns nil because every goroutine
// returns nil — errors are captured in per-service variables above.
_ = g.Wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

golang.org/x/sync/errgroup WithContext documentation: when is the derived context canceled, and what does Wait return if every goroutine returns nil but the parent context is canceled?

💡 Result:

When using golang.org/x/sync/errgroup WithContext: 1) When is the derived context canceled? - The derived context returned by WithContext is canceled the first time either: a) any function passed to Go returns a non-nil error, or b) Wait returns. This cancellation happens “whichever occurs first.” 2) What does Wait return if every goroutine returns nil but the parent context is canceled? - Wait only returns the first non-nil error produced by the goroutines (g.err). If all goroutines return nil, then g.err stays nil. - Wait does still cancel the derived context when it returns (it calls g.cancel(g.err) with g.err possibly nil), but Wait’s return value remains nil. So: if every goroutine returns nil, g.Wait returns nil even if the parent context was canceled (assuming no goroutine returned a non-nil error). If you need Wait to return something like ctx.Err on parent cancellation, you must ensure at least one goroutine returns ctx.Err (or otherwise propagate it), because errgroup.Wait does not automatically return ctx.Err just because the parent context was canceled.

Citations:


🏁 Script executed:

wc -l providers/azure/recommendations.go

Repository: LeanerCloud/CUDly

Length of output: 100


🏁 Script executed:

sed -n '150,165p' providers/azure/recommendations.go

Repository: LeanerCloud/CUDly

Length of output: 598


🏁 Script executed:

sed -n '120,165p' providers/azure/recommendations.go

Repository: LeanerCloud/CUDly

Length of output: 1644


🏁 Script executed:

sed -n '71,75p' providers/azure/recommendations.go

Repository: LeanerCloud/CUDly

Length of output: 234


🏁 Script executed:

sed -n '50,75p' providers/azure/recommendations.go

Repository: LeanerCloud/CUDly

Length of output: 1589


🏁 Script executed:

sed -n '115,162p' providers/azure/recommendations.go

Repository: LeanerCloud/CUDly

Length of output: 1751


Propagate context cancellation instead of returning success on timeout.

Every goroutine returns nil, so g.Wait() cannot surface context.Canceled or context.DeadlineExceeded. If the parent context times out before all results are collected, line 158 still returns nil, allowing callers to treat an empty or partial result set as a clean success.

After g.Wait() at line 115, check ctx.Err() to propagate parent cancellation while preserving the per-service error logging behavior.

Suggested fix
 	// Wait for all goroutines. g.Wait() always returns nil because every goroutine
 	// returns nil — errors are captured in per-service variables above.
 	_ = g.Wait()

 	// Log per-service errors (matches previous sequential behaviour). Append
 	// results in the same deterministic order as the original sequential code
 	// (compute → database → cache → cosmosdb → advisor) so that any
 	// order-sensitive callers remain stable.
 	allRecommendations := make([]common.Recommendation, 0,
 		len(computeRecs)+len(dbRecs)+len(cacheRecs)+len(cosmosRecs)+len(advisorRecs))

 	if computeErr != nil {
 		logging.Warnf("Azure compute recommendations: %v", computeErr)
 	} else {
 		allRecommendations = append(allRecommendations, computeRecs...)
 	}

 	if dbErr != nil {
 		logging.Warnf("Azure database recommendations: %v", dbErr)
 	} else {
 		allRecommendations = append(allRecommendations, dbRecs...)
 	}

 	if cacheErr != nil {
 		logging.Warnf("Azure cache recommendations: %v", cacheErr)
 	} else {
 		allRecommendations = append(allRecommendations, cacheRecs...)
 	}

 	if cosmosErr != nil {
 		logging.Warnf("Azure cosmosdb recommendations: %v", cosmosErr)
 	} else {
 		allRecommendations = append(allRecommendations, cosmosRecs...)
 	}

 	if advisorErr != nil {
 		logging.Errorf("Failed to get Azure Advisor recommendations: %v", advisorErr)
 	} else {
 		allRecommendations = append(allRecommendations, advisorRecs...)
 	}

+	if err := ctx.Err(); err != nil {
+		return allRecommendations, err
+	}
 	return allRecommendations, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@providers/azure/recommendations.go` around lines 71 - 119, The goroutines
swallow context cancellation because every g.Go returns nil and the code
discards g.Wait(); change the final wait handling so that after g.Wait() you
check ctx.Err() (or gctx.Err()) and propagate it if non-nil. Concretely, replace
"_ = g.Wait()" with err := g.Wait(); if err != nil { return err } and/or if
ctx.Err() != nil { return ctx.Err() } (preserving the per-service error
variables and logging) so that context.Canceled or context.DeadlineExceeded from
gctx/ctx is returned to callers; keep each anonymous goroutine bodies (where
computeClient.GetRecommendations, databaseClient.GetRecommendations,
cacheClient.GetRecommendations, cosmosdb.NewClient, r.getAdvisorRecommendations,
etc.) returning nil to retain per-service error isolation.

…#258)

Two CodeRabbit findings on PR #259:

  1. providers/azure/recommendations.go::GetRecommendations now checks
     ctx.Err() after g.Wait() and propagates context.Canceled /
     context.DeadlineExceeded to callers. Without this, every per-
     service goroutine returns nil to the errgroup (intentional, for
     error isolation across services), so the parent ctx being
     cancelled mid-fan-out would silently produce "all services
     completed cleanly" when in fact the caller's deadline expired.

  2. New test TestRecommendationsClientAdapter_GetRecommendations_
     PropagatesContextCancellation pins the contract: calling
     GetRecommendations with a pre-cancelled context must return
     context.Canceled, not nil.

Result-merging extracted into mergeServiceResults helper +
serviceResult struct so GetRecommendations stays under the project's
cyclomatic-complexity gate after the post-Wait ctx.Err() block was
added. Behaviour is identical to the previous inline merge (per-
service errors logged in the same order with the same severity, then
successful results appended in canonical compute -> database -> cache
-> cosmosdb -> advisor order).

CR also requested a concurrency timing test using fake clients per
service to assert total elapsed approx max-single-client (proving the
errgroup actually parallelises). Deferred to a follow-up issue
because the existing service-client constructors build concrete
Azure SDK types with no injection seam.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

CodeRabbit pass-1 response

Fixed (pushed in 7c43b05):

  1. GetRecommendations propagates ctx.Err() after g.Wait() so a parent ctx cancellation / deadline expiry surfaces to the caller instead of being swallowed by the per-service error-isolation goroutines.
  2. New test TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation pins the contract: calling with a pre-cancelled context returns context.Canceled.

Result-merging extracted into a mergeServiceResults helper to keep GetRecommendations under the cyclomatic-complexity gate after the post-Wait ctx.Err() block was added. Behaviour is byte-identical to the inline merge.

Deferred to a follow-up:

  • Concurrency timing test with fake per-service clients (priority/major): the existing service constructors (compute.NewClient, cache.NewClient, etc.) build concrete Azure SDK types with no injection seam. A meaningful timing test asserting elapsed ≈ max-single-service requires introducing package-level function variables that the production code currently doesn't have — that's a larger refactor than a CR-pass-1 fix should be. Tracking as a follow-up issue.

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Tip

For best results, initiate chat on the files or code changes.

@cristim Thanks for the detailed pass-1 summary! The three fixes look well-targeted:

  • Context propagation — surfacing ctx.Err() after g.Wait() is the right contract; callers relying on deadline/cancellation signals will now get them correctly.
  • mergeServiceResults helper — clean separation that also keeps cyclomatic complexity in check.
  • Context-cancellation test — good to have this pinned as a regression guard.

The deferral reasoning for the concurrency timing test is sound. Without an injection seam in the service constructors, any meaningful "elapsed ≈ max-single-service" assertion would need to mock at the package-function level, which is a broader refactor than belongs in this PR.

Triggering a full review of the updated state now.

[full_review]


✏️ Learnings added
Learnt from: cristim
URL: https://github.com/LeanerCloud/CUDly/pull/259

Timestamp: 2026-05-04T14:28:19.885Z
Learning: In providers/azure/recommendations.go (Go, Azure provider), the service constructors (compute.NewClient, cache.NewClient, etc.) build concrete Azure SDK types with no interface injection seam. Introducing a concurrency timing test that asserts elapsed ≈ max-single-service latency would require adding package-level function variables — a refactor that is out of scope for small performance PRs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/xs Trivial / one-liner impact/many Affects most users priority/p2 Backlog-worthy severity/medium Moderate harm triaged Item has been triaged type/bug Defect urgency/this-sprint Within the current sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant