perf: parallelize recommendations collection at every level (closes #266 #267 #268) by cristim · Pull Request #269 · LeanerCloud/CUDly

cristim · 2026-05-04T19:50:54Z

Summary

Mirrors the Azure parallelisation (#258 / b10326c5) onto every other
level of the recommendations-collection tree. After this PR, every
hop of the collect path runs concurrently — total wall time drops from
sum(per-provider × per-account × per-service-or-region) to max(...)
at every level.

Three sibling perf commits, one per closed issue:

perf(aws) (closes perf(aws): parallelize 5 sequential service calls in GetAllRecommendations #266) — GetAllRecommendations's 5 sequential
service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) now
run concurrently under errgroup. Drops the 100ms artificial stagger
(sequential-mode rate-limit hack) and the post-Wait ctx.Err() block
matches the Azure shape. New
TestGetAllRecommendations_PropagatesContextCancellation pins the
contract.
perf(gcp) (closes perf(gcp): parallelize region × service nested loops in GetRecommendations #267) — GetRecommendations's nested
for region { for service { ... } } (~30+ regions × 2 services =
60+ sequential RPCs) is replaced by a two-level fan-out: outer
errgroup over regions (capped at CUDLY_GCP_REGION_PARALLELISM,
default 10) + inner sub-errgroup over (compute, sql) per region.
New TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation
- TestGCPRegionConcurrency pin the contract and env-knob parser.
perf(scheduler) (closes perf(scheduler): parallelize provider collection loop (AWS/Azure/GCP) #268) — CollectRecommendations's
provider loop now fans out AWS/Azure/GCP under errgroup.
Deterministic merge walks EnabledProviders in config order so
successfulProviders ordering is independent of goroutine completion
order. New TestScheduler_CollectRecommendations_ParallelProviders
pins the config-order + ctx-cancellation contracts.

Behaviour change

Per-service / per-(region, service) errors that previously were silently
swallowed by continue (AWS) / if err == nil { ... } (GCP) are now
logged at WARN. Errors still don't propagate to callers — only the log
severity changes from "invisible" to WARN, matching the observability
the Azure parallelisation added.

Wall-time impact (cumulative with #258 + #260)

AWS per account: sum(5 per-service) + 400ms stagger → max(5)
GCP per account: N_regions × (compute + sql) →
ceil(N_regions / cap) × max(compute, sql) (~6× speedup at 30
regions / cap 10, before counting per-region sub-fan-out)
Scheduler total: sum(per-provider) → max(per-provider)
(~30-45s → ~max(15-20s) with all three providers enabled)

User-facing 502 was already gone via #260's async self-invoke; this
PR shrinks the in-flight freshness-banner window and gives cron runs
more headroom before the next tick.

Test plan

go build ./... from repo root — clean
go test ./internal/scheduler/... — 74 pass (was 71; +3 new)
go test ./recommendations/... from providers/aws/ — 176 pass
go test . from providers/gcp/ (top-level package) — 18 pass
(recommendations + concurrency + shouldIncludeService)
Pre-commit hooks pass (gocyclo, secrets scan, gosec, trivy, Go
tests, golangci-lint)
Post-merge: monitor scheduler Lambda CloudWatch logs for the
freshness-banner in-flight window — expected to shrink to
max(per-provider).

🤖 Generated with claude-flow

Summary by CodeRabbit

New Features
- Recommendations are now collected concurrently across providers and services, significantly improving collection performance.
Bug Fixes
- Provider and service failures are now isolated; one failure no longer blocks other providers from completing their collections.
- Context cancellation is properly propagated after concurrent collection completes.
Chores
- Updated dependencies to support improved concurrency handling.

…tions (closes #266) The dispatcher in providers/aws/recommendations/client.go::GetAllRecommendations ran 5 service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) one after another with an artificial 100ms stagger between each. Each per- service call hits AWS Cost Explorer's GetReservationPurchaseRecommendation, which routinely takes multiple seconds. Total wall time per account scaled as sum(per-service) instead of max(per-service). Mirrors the Azure parallelisation in providers/azure/recommendations.go (closes #258, commit b10326c). The five calls have no inter-service dependency, so they can run concurrently under errgroup.WithContext. Each goroutine captures its own error in a closure-scoped variable and returns nil to the group so one service's failure does not cancel siblings (matching the previous loop's `continue`-on-error tolerance). Results are merged in the canonical order EC2 → RDS → ElastiCache → OpenSearch → Redshift after Wait() so any order-sensitive callers stay stable. After Wait(), ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.DeadlineExceeded rather than being swallowed by the per-service error-isolation goroutines (which all return nil). The 100ms artificial stagger is removed — it was a sequential-mode rate- limit hack that adds 400ms per account for no benefit under concurrent fan-out (Cost Explorer rate limits apply per request, not per consecutive request). Behaviour change vs the previous sequential loop: per-service errors are now logged at WARN via the new mergeServiceResults helper. The previous loop swallowed them silently with a bare `continue`, leaving operators no signal when a single service was misbehaving. The serviceResult struct + mergeServiceResults helper extraction also keeps GetAllRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15). New test TestGetAllRecommendations_PropagatesContextCancellation pins the contract: calling GetAllRecommendations with a pre-cancelled context must return context.Canceled. Mirrors the Azure equivalent at providers/azure/recommendations_test.go::TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation. Expected end-to-end after this change: ~max(individual call durations) instead of sum + 4×100ms stagger.

…tions (closes #267) The dispatcher in providers/gcp/recommendations.go::GetRecommendations was doubly serial — `for region { for service { ... } }` — fetching Compute Engine then Cloud SQL recommendations one region at a time. With ~30+ GCP regions × 2 services, that's 60+ sequential RPCs per account against the project-scoped Recommender API. Even at 200ms per call this is a 12s+ floor; in practice most calls are slower. After #260 (async self- invoke) this no longer triggers user-facing 502s, but it directly inflates the scheduler Lambda's runtime and the in-flight freshness- banner window. Two-level concurrent fan-out, mirroring the Azure pattern in providers/azure/recommendations.go (closes #258, commit b10326c) and the AWS service-loop parallelisation (closes #266): - Outer: errgroup over regions, capped at gcpRegionConcurrency() (CUDLY_GCP_REGION_PARALLELISM, default 10) to stay polite to the project-scoped Recommender API quota. Lower than the existing account-level CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) because each account already gets its own goroutine slot from fanOutPerAccount; cumulative concurrency would multiply otherwise. - Inner: within each region's goroutine, Compute + Cloud SQL run as two further goroutines under a per-region sub-errgroup, so per- region cost is max(compute, sql) rather than compute + sql. Each goroutine returns nil to its errgroup so a single per-(region, service) failure does not cancel siblings — preserves the previous silent-skip-on-err shape semantically. Behaviour change vs the previous nested for-loops: per-(region, service) errors that were silently swallowed by the previous `if err == nil { ... }` shape are now logged at WARN with the region+service tag so misconfigured projects are diagnosable. Errors still don't propagate to callers (preserving silent-skip), only the log severity changes from "invisible" to WARN. After the outer Wait(), ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.Deadline- Exceeded rather than being swallowed by the per-region error-isolation goroutines (which all return nil). Result merging walks regions in sorted order and appends compute then sql per region — output is deterministic independent of GCP API region-list ordering or goroutine completion order. The collectRegion helper extraction also keeps GetRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15) after the post-Wait ctx.Err() block was added. Tests added: - TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation pins the contract that a pre-cancelled ctx surfaces as context.Canceled. - TestGCPRegionConcurrency pins the env-knob parser semantics (unset/positive/non-numeric/zero/negative/explicit-unset). Expected end-to-end after this change: ceil(N_regions / cap) × max(compute, sql) instead of N_regions × (compute + sql) — for a project with 30 regions and a cap of 10, that's a ~6× speedup before counting the per-region service- level parallelism win.

…loses #268) The provider loop in scheduler.go::CollectRecommendations ran AWS, Azure and GCP collection sequentially: total wall time = sum(per-provider) rather than max(per-provider). After #260 (async self-invoke) the user- facing 502 on POST /api/recommendations/refresh is gone, but the scheduler Lambda itself still pays the serial cost — visible as a longer "collection in flight" freshness-banner window and reduced cron-tick headroom. Provider-level fan-out under errgroup, mirroring the Azure and AWS service-loop parallelisations (#258 / #266). Each per-provider goroutine returns nil to the group so a single provider's error does not cancel its siblings — preserves the previous loop's `continue`-on-error semantics. Per-provider results are written into a map under a single mutex (the same pattern as fanOutPerAccount at scheduler.go:393). After Wait, ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.DeadlineExceeded rather than being swallowed by the per-provider error-isolation goroutines (which all return nil). Deterministic merge: walk EnabledProviders in config order to assemble successfulProviders / successfulCollects / allRecommendations / totalSavings / failedProviders. Order is by config, NOT by goroutine completion order or by Go's randomised map iteration — keeps existing tests stable and the freshness-banner / notification-email content predictable. The new collectAllProviders helper extraction also keeps CollectRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15) after the errgroup + post-Wait ctx.Err() block was added. No concurrency cap on the outer fan-out — the universe is at most 3 providers; the per-account fan-out inside each provider is still bounded by CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) and per-region GCP fan-out by CUDLY_GCP_REGION_PARALLELISM (default 10, from #267). Test mocks updated: per-provider mock methods (CreateAndValidateProvider, ListCloudAccounts, GetRecommendationsClient, GetAllRecommendations) are now invoked from inside the errgroup goroutine, so they receive the errgroup-derived gctx rather than the caller's ctx. Mock setups changed from literal ctx to mock.Anything where the call sits inside the goroutine path; calls made before/after the fan-out (GetGlobalConfig, SendNewRecommendationsNotification, persistCollection writes) still pin literal ctx because they run on the caller's context unchanged. New TestScheduler_CollectRecommendations_ParallelProviders pins three contracts: 1. successfulProviders ordering matches EnabledProviders config order, not goroutine completion order or alphabetical. Asserted with a deliberately non-alphabetical config (["gcp", "aws", "azure"]) where AWS fails (ambient credentials → factory mock returns error) and Azure/GCP succeed empty (no enabled accounts) — the resulting successfulProviders must be ["gcp", "azure"], distinguishable from any of: input order, alphabetical, or arbitrary map iteration. 2. ctx cancellation propagates: a pre-cancelled ctx surfaces as context.Canceled (not a "successful but empty" CollectResult). Expected wall-time impact: CollectRecommendations drops from sum(per-provider) to max(per-provider). With AWS+Azure+GCP all enabled and Azure typically the slowest at ~15-20s post-#258, total drops from ~30-45s to ~max(15-20s).

coderabbitai · 2026-05-04T19:51:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5f722e7-5d3d-46d8-bb97-e5fbfb995e98

📥 Commits

Reviewing files that changed from the base of the PR and between 472ba71 and 5a7a2ad.

⛔ Files ignored due to path filters (1)

providers/aws/go.sum is excluded by !**/*.sum

📒 Files selected for processing (8)

internal/scheduler/scheduler.go
internal/scheduler/scheduler_test.go
providers/aws/go.mod
providers/aws/recommendations/client.go
providers/aws/recommendations/client_test.go
providers/gcp/go.mod
providers/gcp/recommendations.go
providers/gcp/recommendations_test.go

📝 Walkthrough

Walkthrough

This PR implements concurrent fan-out parallelization at three layers: AWS service collection (5 sequential calls→parallel), GCP region/service collection (doubly serial→nested parallel with configurable concurrency cap), and scheduler provider collection (3 sequential→parallel). Context cancellation is explicitly propagated after errgroup.Wait(), per-provider/service/region failures are isolated via error tracking maps, and deterministic merge order is preserved via config-order iteration or region sorting. Go module dependencies are updated to support errgroup.

Changes

AWS Service-Level Parallelization

Layer / File(s)	Summary
Dependencies `providers/aws/go.mod`	Added `golang.org/x/sync v0.20.0` as direct dependency and updated Go version to 1.25.0.
Service Fan-Out Implementation `providers/aws/recommendations/client.go`	Replaced sequential per-service loop with `errgroup.WithContext` fan-out across EC2, RDS, ElastiCache, OpenSearch, Redshift. Removed 100ms artificial stagger. Added `serviceResult` and `mergeServiceResults` helpers for isolated error handling and deterministic order.
Context Propagation `providers/aws/recommendations/client.go`	After `errgroup.Wait()`, `ctx.Err()` is explicitly checked and propagated so parent context cancellation surfaces as `context.Canceled` rather than silent success.
Test Coverage `providers/aws/recommendations/client_test.go`	Added `TestGetAllRecommendations_PropagatesContextCancellation` asserting pre-cancelled context returns `context.Canceled` and nil recommendations.

GCP Region/Service-Level Parallelization

Layer / File(s)	Summary
Dependencies & Configuration `providers/gcp/go.mod`, `providers/gcp/recommendations.go`	Promoted `golang.org/x/sync v0.6.0` to direct dependency. Added `CUDLY_GCP_REGION_PARALLELISM` environment variable support (default 10) via `gcpRegionConcurrency()` helper and `defaultGCPRegionConcurrency` constant.
Region Fan-Out `providers/gcp/recommendations.go`	Replaced sequential region loop with `errgroup.WithContext` capped at `gcpRegionConcurrency()`. Per-region results stored in mutex-guarded map; deterministic merge performed by sorting region names post-wait.
Per-Region Service Parallelization `providers/gcp/recommendations.go`	New `collectRegion` helper runs compute and Cloud SQL fetches concurrently within each region. Per-service errors log at WARN and are isolated; neither service failure cancels its sibling.
Context Propagation `providers/gcp/recommendations.go`	After outer errgroup wait, `ctx.Err()` is checked and propagated to surface parent context cancellation.
Test Coverage `providers/gcp/recommendations_test.go`	Added `TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation` and `TestGCPRegionConcurrency` for context handling and environment-driven concurrency configuration.

Scheduler Provider-Level Parallelization

Layer / File(s)	Summary
Provider Fan-Out Implementation `internal/scheduler/scheduler.go`	Replaced sequential per-provider loop with `collectAllProviders` method. Uses `errgroup.WithContext` to fan-out AWS, Azure, GCP collection in parallel. New `providerOutcome` struct and mutex-guarded outcome map track success/failure per provider.
Deterministic Aggregation `internal/scheduler/scheduler.go`	After `errgroup.Wait()`, outcomes are merged in `EnabledProviders` config order (not completion order) to produce `successfulProviders`, `successfulCollects`, `allRecommendations`, `totalSavings`, and `failedProviders` map.
Context Propagation `internal/scheduler/scheduler.go`	Per-provider errors are isolated (goroutines return nil); failed outcomes populate `failedProviders` but don't cancel siblings. Post-wait `ctx.Err()` is checked and propagated.
Test Coverage `internal/scheduler/scheduler_test.go`	Added `TestScheduler_CollectRecommendations_ParallelProviders` validating deterministic ordering, per-provider error isolation, and context cancellation propagation. Updated existing tests to relax context matching (use `mock.Anything` instead of strict `ctx`) to account for errgroup-derived contexts.

Sequence Diagram

sequenceDiagram
    participant Scheduler
    participant EG_Providers as errgroup<br/>(Providers)
    participant AWS as AWS Provider
    participant EG_Services as errgroup<br/>(Services)
    participant EC2 as EC2 Service
    participant RDS as RDS Service
    participant Azure as Azure Provider
    participant GCP as GCP Provider
    participant ResultMap as Outcome Map<br/>(mutex)

    Scheduler->>EG_Providers: collectAllProviders(ctx)
    
    par Provider Fan-Out
        EG_Providers->>AWS: Go: collectProviderRecommendations(ctx)
        EG_Providers->>Azure: Go: collectProviderRecommendations(ctx)
        EG_Providers->>GCP: Go: collectProviderRecommendations(ctx)
        
        rect rgba(0, 128, 255, 0.5)
            Note over AWS,EG_Services: AWS (simplified flow)
            AWS->>EG_Services: GetAllRecommendations(ctx)<br/>fan-out 5 services
            par Service Fan-Out
                EG_Services->>EC2: Go: GetRecommendationsForService
                EG_Services->>RDS: Go: GetRecommendationsForService
            end
            EG_Services->>AWS: Wait() + ctx.Err()
            AWS->>ResultMap: record outcome (success/error)
        end
        
        rect rgba(100, 200, 100, 0.5)
            Note over Azure: Azure (already concurrent)
            Azure->>ResultMap: record outcome
        end
        
        rect rgba(255, 165, 0, 0.5)
            Note over GCP: GCP (region+service fan-out)
            GCP->>ResultMap: record outcome
        end
    end
    
    EG_Providers->>EG_Providers: Wait()
    EG_Providers->>EG_Providers: ctx.Err() check
    EG_Providers->>Scheduler: Merge outcomes<br/>in config order
    Scheduler->>Scheduler: Return aggregated results<br/>(allRecommendations,<br/>totalSavings,<br/>successfulProviders,<br/>failedProviders)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

LeanerCloud/CUDly#258: This PR addresses the AWS (#266), GCP (#267), and scheduler (#268) parallelization work that follows the pattern established by the Azure parallelization in #258.

Possibly related PRs

LeanerCloud/CUDly#195: Modifies providers/aws/recommendations/client.go with similar context and result aggregation patterns.
LeanerCloud/CUDly#260: Modifies internal/scheduler/scheduler.go's CollectRecommendations method with related state management and cleanup concerns.

Suggested labels

effort/m, perf, concurrent

Poem

🐰 Five services dance in parallel streams,
Regions fan out like spring's wildest dreams,
Providers leap forth, no longer in queue,
From sequential slowness to speedups that flew!
Deterministic merging keeps order so true,
Cancellation flows cleanly when contexts are through. 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf: parallelize recommendations collection at every level' directly summarizes the main change across all three related issues, indicating the shift from sequential to concurrent collection at AWS, GCP, and scheduler levels.
Linked Issues check	✅ Passed	All code changes directly address the three linked issues: AWS service-loop parallelization under errgroup [`#266`], GCP region+service fan-out with configurable concurrency [`#267`], and scheduler provider fan-out [`#268`]. Context cancellation propagation, per-level error isolation, deterministic merge ordering, and supporting tests are all implemented as specified.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the three parallelization objectives. The go.mod updates (adding golang.org/x/sync, updating Go version) are necessary dependencies for errgroup-based concurrency across all three modules.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/parallelize-recommendations-collection

_{Review rate limit: 4/5 reviews remaining, refill in 12 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cristim · 2026-05-04T19:52:00Z

@coderabbitai review

coderabbitai · 2026-05-04T19:52:05Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim · 2026-05-04T21:04:06Z

@coderabbitai review

Pushed two follow-up commits in response to the live-deploy OOM:

perf(concurrency) (aec56d3): single shared semaphore stashed on ctx, capping aggregate concurrent IO across the whole collect tree at CUDLY_MAX_PARALLELISM (default 20). Leaves Acquire/Release; intermediates don't, so no deadlock by holding-permit-while-waiting-on-sub-permits. Acquire/Release are no-ops when no semaphore is on ctx, so CLI / unit-test paths are unaffected.
infra(aws/dev) (bfda7e4): bump dev Lambda memory 512 → 2048 MB so the user can verify the new stack completes end-to-end on a real deploy. Stage/prod untouched.

Test impact: 11 new pkg/concurrency tests (parser + bounds + cancel-respect); 2 scheduler mocks loosened (UpsertRecommendations/SendNewRecommendationsNotification) since they now receive the wrapped ctx.

coderabbitai · 2026-05-04T21:04:18Z

Tip

For best results, initiate chat on the files or code changes.

@cristim Sure! I'll review the two follow-up commits now — the shared semaphore approach for CUDLY_MAX_PARALLELISM and the dev Lambda memory bump.

[review]

Two CodeRabbit findings on PR #270: 1. **`pkg/concurrency/concurrency_test.go` (CRITICAL)**: `require.NoError` was called from inside the worker goroutines in `TestSharedSemaphore_BoundsConcurrency`. Testify's documented contract is that `require.*` / `FailNow`-style helpers must run on the test's own goroutine — calling them from a worker uses `runtime.Goexit` on the worker instead of safely stopping the test, which can hang or skip cleanup and produces race-detector noise. Workers now capture their Acquire result on a buffered `errCh` and the main goroutine drains the channel and asserts after `wg.Wait()`. `Release` is deferred only on a successful Acquire, matching the documented pairing contract. 2. **`providers/aws/recommendations/client.go` (MAJOR)**: The per-service `g.Go` goroutines in `GetAllRecommendations` were calling `concurrency.Acquire`/`Release` around the whole `GetRecommendationsForService` sweep — but that helper internally loops 2 × 3 (term, payment) variants and waits on the rate-limiter inside each retry, so a throttled service would hold a global slot for seconds while no Cost Explorer request was actually in flight. Effective parallelism dropped well below `CUDLY_MAX_PARALLELISM`. The Acquire/Release boundary moves down to each individual `costExplorerClient.GetReservationPurchaseRecommendation` call (and the SP equivalent in `parser_sp.go`). The rate-limiter's `Wait` happens *outside* the permit, so a goroutine waiting on exponential-backoff backoff doesn't tie up a slot. Each retry acquires a fresh permit, so throughput now scales with the cap regardless of how many services are individually throttled. Tests: - `go test ./recommendations/...` from `providers/aws/` — pass - `go test -race ./concurrency/...` from `pkg/` — race-clean - Existing `go test -race ./recommendations/...` reveals a pre-existing data race on `Client.rateLimiter` (concurrent `Reset` / state mutation across the per-service goroutines), introduced by the parallelisation in #269 (`defac6c2a`). Filed as #271 — out of scope for this CR-fix commit, fix is straightforward (per-goroutine rate- limiter or internal mutex). Other CodeRabbit checks (Azure / GCP per-service goroutines holding permits across SDK calls) are NOT applied here because: - Azure's per-service SDK calls don't have an explicit rate-limiter Wait loop equivalent to AWS's — the inner ARM/pricing calls block on network IO, not exponential backoff. - GCP's `collectRegion` sub-fan-out goroutines do similarly direct SDK calls without an explicit Wait loop. - CR didn't flag those layers; if a future review surfaces the same argument with concrete throughput evidence, we can mirror the AWS pattern there.

…llowup #269) (#270) * perf(concurrency): cap aggregate collect parallelism via shared semaphore The recommendations-collection fan-out has up to four nested levels: provider → account → service|region → per-region service. Each level was independently capped, so peak goroutine counts multiplied through the tree (3 providers × 20 accounts × 30 GCP regions × 2 services = 3 600 in-flight gRPC/HTTP clients on a real multi-account deploy). On a 512 MB Lambda this exhausted memory before the work could finish — observed in dev as `signal: killed` at Max Memory Used = 512 MB after ~28s, with the runtime never reaching the 60s timeout. A single semaphore stashed on ctx now lets every leaf goroutine (the goroutine that issues the actual cloud-API call) Acquire one slot before doing IO and Release after, so aggregate concurrent IO is hard-bounded by CUDLY_MAX_PARALLELISM (default 20) regardless of how nested the dispatch is. Intermediate dispatchers (provider, account, GCP region) do NOT acquire — they only launch sub-goroutines — so no goroutine can deadlock by holding a permit while waiting for sub-permits. Mechanics: - New pkg/concurrency package: WithSharedSemaphore / SharedSemaphore / Acquire / Release helpers that read the semaphore from ctx, plus MaxParallelismFromEnv reading CUDLY_MAX_PARALLELISM (default 20). Acquire/Release are no-ops when no semaphore is on ctx — CLI tools and unit tests skip the semaphore entirely without per-call branching. - Scheduler.CollectRecommendations allocates the semaphore (size from env) and attaches to ctx via concurrency.WithSharedSemaphore BEFORE the provider fan-out, so the wrapped ctx flows through every nested level. Logs the effective cap on entry. - Three leaf fan-outs wired: - providers/aws/recommendations/client.go::GetAllRecommendations — 5 service goroutines (EC2 / RDS / ElastiCache / OpenSearch / Redshift) - providers/azure/recommendations.go::GetRecommendations — 5 service goroutines (compute / database / cache / cosmosdb / advisor); Acquire/Release boilerplate extracted into a goService helper so the function stays under gocyclo's 10-branch gate. - providers/gcp/recommendations.go::collectRegion — 2 sub-fan-out goroutines (compute / cloudsql) per region Each leaf calls Acquire(gctx) at the top, defers Release(gctx). On Acquire failure (parent ctx cancelled mid-wait) the leaf's per-service err variable is set to ctx.Err() and the goroutine returns nil to the errgroup — preserves the documented error-isolation contract. Test mocks updated: persistCollection's UpsertRecommendations and the notification email's SendNewRecommendationsNotification both run inside CollectRecommendations after the ctx is wrapped, so their mock setups were pinning the original (unwrapped) ctx. mock.Anything is the appropriate looseness — the wrapped ctx is implementation detail. New pkg/concurrency tests pin: - MaxParallelismFromEnv env-knob parser (unset / positive / zero / negative / non-numeric / explicit-unset) - Acquire/Release are no-ops when no semaphore on ctx - WithSharedSemaphore returns ctx unchanged when sem is nil - 20 goroutines under cap=3 never see >3 in-flight (load-bearing) - Acquire returns ctx.Err() when cancelled mid-wait This is a smoke-test config — the user's dev deploy hit OOM with the existing per-level caps; CUDLY_MAX_PARALLELISM=20 (paired with a Lambda memory bump to 2 GB in a follow-up commit) should give the dev refresh headroom to complete. Operators can dial up further via env override. * infra(aws/dev): bump Lambda memory 512 MB → 2 GB for collect headroom The dev Lambda hit Max Memory Used = 512 MB and was killed by the runtime at ~28s (Duration 28701ms / Timeout 60s) on a user-triggered recommendations refresh. The collection fan-out (post-#266 #267 #268) keeps many concurrent gRPC/HTTP clients in flight — 30+ GCP regions × 2 services each, plus 5 AWS services and 5 Azure services — and 512 MB is too tight for that working-set even with the new shared-semaphore cap. This is a temporary headroom bump scoped to dev so the user can verify the post-#266/#267/#268/concurrency-semaphore stack actually completes end-to-end on a deployed Lambda. The shared semaphore (preceding commit) is the durable fix for peak concurrency; this just gives the function room to breathe while we observe the new behaviour live. Only github-dev.tfvars is touched. github-staging.tfvars and github-prod.tfvars stay at 1024 MB — they don't exercise every provider with mostly-broken credentials the way dev does, and bumping prod without first observing dev would be premature. dev.tfvars (the user's local-apply config, gitignored) is also bumped locally to keep parity with the CI deploy until this lands. * fix(concurrency,aws): address PR #270 CodeRabbit review Two CodeRabbit findings on PR #270: 1. **`pkg/concurrency/concurrency_test.go` (CRITICAL)**: `require.NoError` was called from inside the worker goroutines in `TestSharedSemaphore_BoundsConcurrency`. Testify's documented contract is that `require.*` / `FailNow`-style helpers must run on the test's own goroutine — calling them from a worker uses `runtime.Goexit` on the worker instead of safely stopping the test, which can hang or skip cleanup and produces race-detector noise. Workers now capture their Acquire result on a buffered `errCh` and the main goroutine drains the channel and asserts after `wg.Wait()`. `Release` is deferred only on a successful Acquire, matching the documented pairing contract. 2. **`providers/aws/recommendations/client.go` (MAJOR)**: The per-service `g.Go` goroutines in `GetAllRecommendations` were calling `concurrency.Acquire`/`Release` around the whole `GetRecommendationsForService` sweep — but that helper internally loops 2 × 3 (term, payment) variants and waits on the rate-limiter inside each retry, so a throttled service would hold a global slot for seconds while no Cost Explorer request was actually in flight. Effective parallelism dropped well below `CUDLY_MAX_PARALLELISM`. The Acquire/Release boundary moves down to each individual `costExplorerClient.GetReservationPurchaseRecommendation` call (and the SP equivalent in `parser_sp.go`). The rate-limiter's `Wait` happens *outside* the permit, so a goroutine waiting on exponential-backoff backoff doesn't tie up a slot. Each retry acquires a fresh permit, so throughput now scales with the cap regardless of how many services are individually throttled. Tests: - `go test ./recommendations/...` from `providers/aws/` — pass - `go test -race ./concurrency/...` from `pkg/` — race-clean - Existing `go test -race ./recommendations/...` reveals a pre-existing data race on `Client.rateLimiter` (concurrent `Reset` / state mutation across the per-service goroutines), introduced by the parallelisation in #269 (`defac6c2a`). Filed as #271 — out of scope for this CR-fix commit, fix is straightforward (per-goroutine rate- limiter or internal mutex). Other CodeRabbit checks (Azure / GCP per-service goroutines holding permits across SDK calls) are NOT applied here because: - Azure's per-service SDK calls don't have an explicit rate-limiter Wait loop equivalent to AWS's — the inner ARM/pricing calls block on network IO, not exponential backoff. - GCP's `collectRegion` sub-fan-out goroutines do similarly direct SDK calls without an explicit Wait loop. - CR didn't flag those layers; if a future review surfaces the same argument with concrete throughput evidence, we can mirror the AWS pattern there.

cristim added 3 commits May 4, 2026 21:30

cristim merged commit defac6c into feat/multicloud-web-frontend May 4, 2026
4 checks passed

cristim deleted the perf/parallelize-recommendations-collection branch May 4, 2026 20:30

This was referenced May 4, 2026

perf: aggregate parallelism cap + 2 GB Lambda for collect refresh (followup #269) #270

Merged

bug(aws/recommendations): RateLimiter.Reset race after #269 parallelised GetAllRecommendations #271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize recommendations collection at every level (closes #266 #267 #268)#269

perf: parallelize recommendations collection at every level (closes #266 #267 #268)#269
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
perf/parallelize-recommendations-collection

cristim commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

Uh oh!

cristim commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

Uh oh!

cristim commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cristim commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behaviour change

Wall-time impact (cumulative with #258 + #260)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

Uh oh!

cristim commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

Uh oh!

cristim commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cristim commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading