Skip to content

perf: aggregate parallelism cap + 2 GB Lambda for collect refresh (followup #269)#270

Merged
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
perf/recommendations-aggregate-cap
May 4, 2026
Merged

perf: aggregate parallelism cap + 2 GB Lambda for collect refresh (followup #269)#270
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
perf/recommendations-aggregate-cap

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 4, 2026

Summary

Follow-up to #269 (defac6c2a) — when that PR was deployed to dev, the
new fan-out parallelism exposed two operational problems we couldn't
see on the previous serial code path:

  1. Lambda OOM at 28s (Max Memory Used: 512 MB, signal: killed)
    on the user-triggered refresh — peak in-flight gRPC/HTTP clients
    across the now-concurrent provider × account × service|region tree
    exceeded the 512 MB cap.
  2. 60s timeout on a follow-up retry (Task timed out after 63.05 seconds) — same root cause; without a global cap, runaway
    concurrent IO never settles inside the budget.

Two atomic commits address both:

  • perf(concurrency) — new pkg/concurrency package adds a
    shared semaphore stashed on ctx. Every leaf goroutine across AWS /
    Azure / GCP fan-outs Acquires one slot before issuing its cloud-API
    call and Releases after. Aggregate concurrent IO is now hard-bounded
    at CUDLY_MAX_PARALLELISM (default 20) regardless of nesting
    depth. Intermediates (provider, account, GCP region dispatcher)
    don't acquire — only leaves do — so no deadlock by holding-permit-
    while-waiting-for-sub-permits. Acquire/Release are no-ops when
    no semaphore is on ctx, so CLI tools and unit tests pass through
    unchanged. Mirrors the per-fan-out pattern but adds a single global
    bound; per-level caps still help with launch-side bookkeeping.

    • 11 new tests in pkg/concurrency: env-knob parser, no-semaphore
      no-ops, cap=3 bounds load test (20 goroutines, peak ≤ 3),
      ctx-cancel propagation.
    • 5 scheduler mocks loosened to mock.Anything for the now-wrapped
      ctx (UpsertRecommendations, SendNewRecommendationsNotification).
    • Azure's GetRecommendations extracts a goService closure to
      keep the function under the project's 10-branch gocyclo gate
      after the per-service Acquire branches were added.
  • infra(aws/dev) — bump dev Lambda memory 512 MB → 2 GB in
    terraform/environments/aws/github-dev.tfvars. Temporary headroom
    scoped to dev — the semaphore is the durable fix, this just gives
    the function room to breathe while we observe the new behaviour
    live. Staging/prod untouched.

Test plan

  • go build ./... from repo root — clean
  • go test ./internal/scheduler/... — pass (74 tests, including
    the 3 added in perf: parallelize recommendations collection at every level (closes #266 #267 #268) #269)
  • go test ./recommendations/... from providers/aws/ — 176 pass
  • go test . from providers/azure/ — 104 pass
  • go test . from providers/gcp/ (recs subset) — 18 pass
  • go test ./concurrency/... from pkg/ — 11 pass
  • Post-merge: redeploy + watch CloudWatch for the
    "Recommendations collection: aggregate parallelism cap = 20"
    log line on entry, and confirm Max Memory Used < 2048 MB and
    Duration < 60s on a full refresh.

🤖 Generated with claude-flow

Summary by CodeRabbit

  • New Features

    • Configurable, process-wide concurrency limiting for cloud API calls across AWS, Azure, and GCP to improve stability and prevent resource exhaustion during batch recommendation processing.
  • Chores

    • Increased AWS Lambda memory allocation from 512 MB to 2048 MB to enhance processing capacity and throughput.

cristim added 2 commits May 5, 2026 00:08
…hore

The recommendations-collection fan-out has up to four nested levels:
provider → account → service|region → per-region service. Each level was
independently capped, so peak goroutine counts multiplied through the
tree (3 providers × 20 accounts × 30 GCP regions × 2 services = 3 600
in-flight gRPC/HTTP clients on a real multi-account deploy). On a 512 MB
Lambda this exhausted memory before the work could finish — observed in
dev as `signal: killed` at Max Memory Used = 512 MB after ~28s, with the
runtime never reaching the 60s timeout.

A single semaphore stashed on ctx now lets every leaf goroutine (the
goroutine that issues the actual cloud-API call) Acquire one slot before
doing IO and Release after, so aggregate concurrent IO is hard-bounded
by CUDLY_MAX_PARALLELISM (default 20) regardless of how nested the
dispatch is. Intermediate dispatchers (provider, account, GCP region)
do NOT acquire — they only launch sub-goroutines — so no goroutine can
deadlock by holding a permit while waiting for sub-permits.

Mechanics:

- New pkg/concurrency package: WithSharedSemaphore / SharedSemaphore /
  Acquire / Release helpers that read the semaphore from ctx, plus
  MaxParallelismFromEnv reading CUDLY_MAX_PARALLELISM (default 20).
  Acquire/Release are no-ops when no semaphore is on ctx — CLI tools
  and unit tests skip the semaphore entirely without per-call branching.

- Scheduler.CollectRecommendations allocates the semaphore (size from
  env) and attaches to ctx via concurrency.WithSharedSemaphore BEFORE
  the provider fan-out, so the wrapped ctx flows through every nested
  level. Logs the effective cap on entry.

- Three leaf fan-outs wired:
    - providers/aws/recommendations/client.go::GetAllRecommendations —
      5 service goroutines (EC2 / RDS / ElastiCache / OpenSearch /
      Redshift)
    - providers/azure/recommendations.go::GetRecommendations — 5
      service goroutines (compute / database / cache / cosmosdb /
      advisor); Acquire/Release boilerplate extracted into a goService
      helper so the function stays under gocyclo's 10-branch gate.
    - providers/gcp/recommendations.go::collectRegion — 2 sub-fan-out
      goroutines (compute / cloudsql) per region

  Each leaf calls Acquire(gctx) at the top, defers Release(gctx). On
  Acquire failure (parent ctx cancelled mid-wait) the leaf's per-service
  err variable is set to ctx.Err() and the goroutine returns nil to the
  errgroup — preserves the documented error-isolation contract.

Test mocks updated: persistCollection's UpsertRecommendations and the
notification email's SendNewRecommendationsNotification both run inside
CollectRecommendations after the ctx is wrapped, so their mock setups
were pinning the original (unwrapped) ctx. mock.Anything is the
appropriate looseness — the wrapped ctx is implementation detail.

New pkg/concurrency tests pin:
  - MaxParallelismFromEnv env-knob parser (unset / positive / zero /
    negative / non-numeric / explicit-unset)
  - Acquire/Release are no-ops when no semaphore on ctx
  - WithSharedSemaphore returns ctx unchanged when sem is nil
  - 20 goroutines under cap=3 never see >3 in-flight (load-bearing)
  - Acquire returns ctx.Err() when cancelled mid-wait

This is a smoke-test config — the user's dev deploy hit OOM with the
existing per-level caps; CUDLY_MAX_PARALLELISM=20 (paired with a Lambda
memory bump to 2 GB in a follow-up commit) should give the dev refresh
headroom to complete. Operators can dial up further via env override.
The dev Lambda hit Max Memory Used = 512 MB and was killed by the
runtime at ~28s (Duration 28701ms / Timeout 60s) on a user-triggered
recommendations refresh. The collection fan-out (post-#266 #267 #268)
keeps many concurrent gRPC/HTTP clients in flight — 30+ GCP regions ×
2 services each, plus 5 AWS services and 5 Azure services — and 512 MB
is too tight for that working-set even with the new shared-semaphore
cap.

This is a temporary headroom bump scoped to dev so the user can verify
the post-#266/#267/#268/concurrency-semaphore stack actually completes
end-to-end on a deployed Lambda. The shared semaphore (preceding
commit) is the durable fix for peak concurrency; this just gives the
function room to breathe while we observe the new behaviour live.

Only github-dev.tfvars is touched. github-staging.tfvars and
github-prod.tfvars stay at 1024 MB — they don't exercise every
provider with mostly-broken credentials the way dev does, and bumping
prod without first observing dev would be premature.

dev.tfvars (the user's local-apply config, gitignored) is also bumped
locally to keep parity with the CI deploy until this lands.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

A new package provides a process-wide, environment-configurable weighted semaphore. The scheduler installs a shared semaphore on the context, and AWS/Azure/GCP provider code acquires/releases permits around their leaf cloud-API calls to cap aggregate parallelism while preserving per-service error isolation.

Changes

Concurrency Control Integration

Layer / File(s) Summary
Data Shape / Config
pkg/concurrency/concurrency.go
New package: DefaultMaxParallelism, MaxParallelismFromEnv() reads CUDLY_MAX_PARALLELISM with validation; context key wiring WithSharedSemaphore / SharedSemaphore.
Core Primitives
pkg/concurrency/concurrency.go
Acquire(ctx) blocks to obtain one permit from a context-attached *semaphore.Weighted, returns ctx.Err() on cancellation; Release(ctx) releases one permit; both are no-ops if no semaphore attached.
Scheduler Wiring
internal/scheduler/scheduler.go
CollectRecommendations() creates a weighted semaphore with MaxParallelismFromEnv(), attaches it to the shared ctx via WithSharedSemaphore(), and logs the cap.
Provider Integration (leaf calls)
providers/aws/recommendations/client.go, providers/aws/recommendations/parser_sp.go, providers/azure/recommendations.go, providers/gcp/recommendations.go
Per-provider/leaf SDK calls are wrapped with concurrency.Acquire(ctx) / defer concurrency.Release(ctx) (or immediate Release), acquisition failures are recorded into per-service error variables and goroutines return nil to preserve sibling isolation.
Tests / Mocks
pkg/concurrency/concurrency_test.go, internal/scheduler/scheduler_test.go
New tests for env parsing, no-op behavior, bounding and cancellation for concurrency primitives; scheduler tests accept wrapped contexts (mock.Anything) for persistence/email calls.
Module / Infra
pkg/go.mod, providers/gcp/go.mod, terraform/environments/aws/github-dev.tfvars
Go version bumped to 1.25.0 and golang.org/x/sync added/updated; Lambda memory increased to 2048 MB.

Sequence Diagram

sequenceDiagram
    participant Scheduler
    participant Concurrency as Concurrency Package
    participant Context as Context
    participant Provider as Provider (AWS/Azure/GCP)
    participant CloudAPI as Cloud API

    Scheduler->>Concurrency: MaxParallelismFromEnv()
    activate Concurrency
    Concurrency-->>Scheduler: numeric cap
    deactivate Concurrency

    Scheduler->>Concurrency: WithSharedSemaphore(ctx, sem)
    activate Concurrency
    Concurrency->>Context: attach semaphore
    Concurrency-->>Scheduler: wrapped ctx
    deactivate Concurrency

    Scheduler->>Provider: CollectRecommendations(wrapped ctx)

    rect rgba(100, 150, 200, 0.5)
        note over Provider: service/region fan-out
        par Service A
            Provider->>Concurrency: Acquire(wrapped ctx)
            activate Concurrency
            Concurrency-->>Provider: permit acquired (or wait)
            deactivate Concurrency
            Provider->>CloudAPI: Leaf API call
            CloudAPI-->>Provider: response/error
            Provider->>Concurrency: Release(wrapped ctx)
        and Service B
            Provider->>Concurrency: Acquire(wrapped ctx)
            activate Concurrency
            Concurrency-->>Provider: permit acquired (or wait)
            deactivate Concurrency
            Provider->>CloudAPI: Leaf API call
            CloudAPI-->>Provider: response/error
            Provider->>Concurrency: Release(wrapped ctx)
        end
    end

    Provider-->>Scheduler: merged results/errors
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Possibly related PRs

  • LeanerCloud/CUDly#195: Modifies AWS recommendations fan-out; related to how Cost Explorer requests are issued.
  • LeanerCloud/CUDly#259: Also changes Azure per-service concurrent calls; related integration pattern.
  • LeanerCloud/CUDly#269: Parallelizes provider fan-out via errgroup; this PR complements it by capping total concurrency via a shared semaphore.

Suggested labels

type/feat

Poem

🐇 In code I hop, counting slots with care,
A semaphore nest keeps the cloud calls fair.
No stampede of requests, just orderly rhyme —
One permit, one hop, concurrency in time. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically summarizes the main changes: introducing an aggregate parallelism cap and increasing Lambda memory, which are the core objectives of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/recommendations-aggregate-cap

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim cristim added priority/p2 Backlog-worthy severity/medium Moderate harm urgency/this-sprint Within the current sprint impact/many Affects most users effort/m Days type/bug Defect triaged Item has been triaged labels May 4, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/concurrency/concurrency_test.go`:
- Around line 82-95: The worker goroutines call require.NoError(t, Acquire(ctx))
which is illegal from non-test goroutines; instead, have each goroutine capture
the error and report it back to the main test goroutine (e.g., send errors on an
err channel or append to a slice with a mutex) and only call require.NoError
from the main goroutine after wg.Wait(); also ensure Release(ctx) is only
deferred when Acquire returned nil (successful) so the goroutine sends the
Acquire error (or nil) back, main goroutine closes the channel after wg.Wait()
and asserts NoError for each reported error. Reference Acquire, Release,
inflight, updatePeak, goroutines, and wg to locate the logic to change.

In `@providers/aws/recommendations/client.go`:
- Around line 196-244: The current goroutines (the g.Go blocks) acquire a global
semaphore before calling GetRecommendationsForService, which holds a slot for
the entire service sweep; move the semaphore boundary into the Cost Explorer
call itself so each outbound CE request acquires/releases a permit. Remove
concurrency.Acquire/Release from the outer g.Go closures and instead update
GetRecommendationsForService (or the helper it calls) to call
concurrency.Acquire(ctx) immediately before each individual Cost Explorer API
invocation and defer concurrency.Release(ctx) right after that request/response
(including retries/backoffs around only the API call). This ensures the
semaphore limits concurrent CE calls, not per-service sweeps.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a60a4b4-4a48-44a2-8222-e2c11a8bf20c

📥 Commits

Reviewing files that changed from the base of the PR and between defac6c and c9c314c.

⛔ Files ignored due to path filters (2)
  • pkg/go.sum is excluded by !**/*.sum
  • providers/gcp/go.sum is excluded by !**/*.sum
📒 Files selected for processing (10)
  • internal/scheduler/scheduler.go
  • internal/scheduler/scheduler_test.go
  • pkg/concurrency/concurrency.go
  • pkg/concurrency/concurrency_test.go
  • pkg/go.mod
  • providers/aws/recommendations/client.go
  • providers/azure/recommendations.go
  • providers/gcp/go.mod
  • providers/gcp/recommendations.go
  • terraform/environments/aws/github-dev.tfvars

Comment thread pkg/concurrency/concurrency_test.go
Comment thread providers/aws/recommendations/client.go Outdated
Two CodeRabbit findings on PR #270:

1. **`pkg/concurrency/concurrency_test.go` (CRITICAL)**:
   `require.NoError` was called from inside the worker goroutines in
   `TestSharedSemaphore_BoundsConcurrency`. Testify's documented contract
   is that `require.*` / `FailNow`-style helpers must run on the test's
   own goroutine — calling them from a worker uses `runtime.Goexit` on
   the worker instead of safely stopping the test, which can hang or
   skip cleanup and produces race-detector noise.

   Workers now capture their Acquire result on a buffered `errCh` and
   the main goroutine drains the channel and asserts after `wg.Wait()`.
   `Release` is deferred only on a successful Acquire, matching the
   documented pairing contract.

2. **`providers/aws/recommendations/client.go` (MAJOR)**:
   The per-service `g.Go` goroutines in `GetAllRecommendations` were
   calling `concurrency.Acquire`/`Release` around the whole
   `GetRecommendationsForService` sweep — but that helper internally
   loops 2 × 3 (term, payment) variants and waits on the rate-limiter
   inside each retry, so a throttled service would hold a global slot
   for seconds while no Cost Explorer request was actually in flight.
   Effective parallelism dropped well below `CUDLY_MAX_PARALLELISM`.

   The Acquire/Release boundary moves down to each individual
   `costExplorerClient.GetReservationPurchaseRecommendation` call (and
   the SP equivalent in `parser_sp.go`). The rate-limiter's
   `Wait` happens *outside* the permit, so a goroutine waiting on
   exponential-backoff backoff doesn't tie up a slot. Each retry
   acquires a fresh permit, so throughput now scales with the cap
   regardless of how many services are individually throttled.

Tests:
- `go test ./recommendations/...` from `providers/aws/` — pass
- `go test -race ./concurrency/...` from `pkg/` — race-clean
- Existing `go test -race ./recommendations/...` reveals a pre-existing
  data race on `Client.rateLimiter` (concurrent `Reset` / state
  mutation across the per-service goroutines), introduced by the
  parallelisation in #269 (`defac6c2a`). Filed as #271 — out of scope
  for this CR-fix commit, fix is straightforward (per-goroutine rate-
  limiter or internal mutex).

Other CodeRabbit checks (Azure / GCP per-service goroutines holding
permits across SDK calls) are NOT applied here because:
- Azure's per-service SDK calls don't have an explicit rate-limiter
  Wait loop equivalent to AWS's — the inner ARM/pricing calls block
  on network IO, not exponential backoff.
- GCP's `collectRegion` sub-fan-out goroutines do similarly direct SDK
  calls without an explicit Wait loop.
- CR didn't flag those layers; if a future review surfaces the same
  argument with concrete throughput evidence, we can mirror the AWS
  pattern there.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

Addressed both CR findings in 092e0b3:

1. pkg/concurrency/concurrency_test.go:95 (Critical — require.* from worker goroutine) — Fixed exactly as suggested. Workers now capture the Acquire result on a buffered errCh; main goroutine drains and asserts after wg.Wait(). Release is deferred only on successful Acquire. Verified with go test -race ./concurrency/... — race-clean.

2. providers/aws/recommendations/client.go:244 (Major — per-sweep permit holds slot during rate-limiter Wait) — Boundary moved down. Acquire/Release now wraps just the costExplorerClient.GetReservationPurchaseRecommendation call (and the SP equivalent in parser_sp.go); the rate-limiter Wait runs outside the permit, so backoff waits free the slot for siblings. Each retry re-acquires. Per-service g.Go bodies in GetAllRecommendations no longer Acquire — they only dispatch.

Out-of-scope finding surfaced during validation: go test -race ./recommendations/... reveals a pre-existing data race on Client.rateLimiter (concurrent Reset + state mutation across the 5 per-service goroutines). It was introduced by #269 (defac6c2a), not by either of the changes on this PR. Filed as #271 with a tight fix proposal (per-goroutine rate-limiter or internal mutex) — keeping this PR scoped to the CR feedback.

Azure/GCP not changed — CR's specific argument (permit held across rate-limiter Wait loop) is structural to AWS's Cost Explorer client, which has an explicit c.rateLimiter.Wait(ctx) inside its retry loop. Azure/GCP per-service goroutines block on network IO inside their SDK calls, not on a Wait-loop, so the throughput issue doesn't apply identically. Happy to mirror the pattern there in a follow-up if a future review surfaces concrete evidence.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Tip

For best results, initiate chat on the files or code changes.

@cristim: Thanks for the detailed write-up. Both fixes look well-reasoned:

  1. errCh pattern — moving assertions out of worker goroutines is the correct approach; the buffered channel + drain-after-wg.Wait() pattern is idiomatic and race-safe. Good call deferring Release only on successful Acquire.

  2. Narrowed Acquire/Release boundary — wrapping only the actual CE API call (and SP equivalent in parser_sp.go) rather than the entire retry sweep is the right call. Freeing the slot during rate-limiter backoff maximises sibling throughput under the cap, and re-acquiring per retry is the correct tradeoff.

The Azure/GCP non-change rationale is sound — the concern is specific to the explicit Wait loop in the CE retry path; SDK-internal blocking IO doesn't have the same slot-hoarding characteristic. A follow-up to mirror the tighter boundary there if profiling ever surfaces it makes sense.

Good catch on the Client.rateLimiter race and the clean separation into #271 — keeping this PR scoped is the right call.

Triggering a fresh review of the updated commit.

[review]

@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
providers/aws/recommendations/client.go (1)

74-100: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the retry limiter request-scoped, not client-scoped.

GetAllRecommendations fans out multiple service sweeps concurrently, but this path still mutates the shared c.rateLimiter via Reset/Wait/ShouldRetry/GetRetryCount. That lets concurrent requests clobber each other’s retry/backoff state and matches the race already seen under go test -race. Please instantiate a fresh limiter per recommendation sweep here, and mirror that change in the Savings Plans path too.

Suggested direction
-	c.rateLimiter.Reset()
+	rateLimiter := NewRateLimiter()
+	rateLimiter.Reset()
 	for {
-		if waitErr := c.rateLimiter.Wait(ctx); waitErr != nil {
+		if waitErr := rateLimiter.Wait(ctx); waitErr != nil {
 			return nil, fmt.Errorf("rate limiter wait failed: %w", waitErr)
 		}

 		if acqErr := concurrency.Acquire(ctx); acqErr != nil {
 			return nil, fmt.Errorf("concurrency acquire failed: %w", acqErr)
 		}
 		result, err = c.costExplorerClient.GetReservationPurchaseRecommendation(ctx, input)
 		concurrency.Release(ctx)
-		if !c.rateLimiter.ShouldRetry(err) {
+		if !rateLimiter.ShouldRetry(err) {
 			break
 		}
 	}

 	if err != nil {
-		return nil, fmt.Errorf("failed to get RI recommendations after %d retries: %w", c.rateLimiter.GetRetryCount(), err)
+		return nil, fmt.Errorf("failed to get RI recommendations after %d retries: %w", rateLimiter.GetRetryCount(), err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/aws/recommendations/client.go` around lines 74 - 100, The shared
c.rateLimiter is being mutated across concurrent sweeps; make the limiter
request-scoped by creating a fresh local limiter instance at the start of the
recommendation loop and use that instead of c.rateLimiter. Concretely, before
the for-loop in GetReservationPurchaseRecommendation create a copy/localLimiter
(e.g., via a Clone/Copy/New method or by constructing a new limiter with the
same config), call
localLimiter.Reset()/localLimiter.Wait()/localLimiter.ShouldRetry()/localLimiter.GetRetryCount()
inside the loop, and remove mutations of c.rateLimiter; then apply the same
change to the Savings Plans path (GetSavingsPlanPurchaseRecommendation) and
ensure callers in GetAllRecommendations continue to fan out safely using
per-request limiters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@providers/aws/recommendations/client.go`:
- Around line 74-100: The shared c.rateLimiter is being mutated across
concurrent sweeps; make the limiter request-scoped by creating a fresh local
limiter instance at the start of the recommendation loop and use that instead of
c.rateLimiter. Concretely, before the for-loop in
GetReservationPurchaseRecommendation create a copy/localLimiter (e.g., via a
Clone/Copy/New method or by constructing a new limiter with the same config),
call
localLimiter.Reset()/localLimiter.Wait()/localLimiter.ShouldRetry()/localLimiter.GetRetryCount()
inside the loop, and remove mutations of c.rateLimiter; then apply the same
change to the Savings Plans path (GetSavingsPlanPurchaseRecommendation) and
ensure callers in GetAllRecommendations continue to fan out safely using
per-request limiters.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f79e1acf-ebc6-4472-adec-8a90d306780f

📥 Commits

Reviewing files that changed from the base of the PR and between c9c314c and 092e0b3.

📒 Files selected for processing (3)
  • pkg/concurrency/concurrency_test.go
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/parser_sp.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/concurrency/concurrency_test.go

@cristim cristim merged commit e010987 into feat/multicloud-web-frontend May 4, 2026
4 checks passed
@cristim cristim deleted the perf/recommendations-aggregate-cap branch May 4, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/m Days impact/many Affects most users priority/p2 Backlog-worthy severity/medium Moderate harm triaged Item has been triaged type/bug Defect urgency/this-sprint Within the current sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant