Skip to content

perf: parallelize recommendations collection at every level (closes #266 #267 #268)#269

Merged
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
perf/parallelize-recommendations-collection
May 4, 2026
Merged

perf: parallelize recommendations collection at every level (closes #266 #267 #268)#269
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
perf/parallelize-recommendations-collection

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 4, 2026

Summary

Mirrors the Azure parallelisation (#258 / b10326c5) onto every other
level of the recommendations-collection tree. After this PR, every
hop of the collect path runs concurrently — total wall time drops from
sum(per-provider × per-account × per-service-or-region) to max(...)
at every level.

Three sibling perf commits, one per closed issue:

  • perf(aws) (closes perf(aws): parallelize 5 sequential service calls in GetAllRecommendations #266)GetAllRecommendations's 5 sequential
    service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) now
    run concurrently under errgroup. Drops the 100ms artificial stagger
    (sequential-mode rate-limit hack) and the post-Wait ctx.Err() block
    matches the Azure shape. New
    TestGetAllRecommendations_PropagatesContextCancellation pins the
    contract.
  • perf(gcp) (closes perf(gcp): parallelize region × service nested loops in GetRecommendations #267)GetRecommendations's nested
    for region { for service { ... } } (~30+ regions × 2 services =
    60+ sequential RPCs) is replaced by a two-level fan-out: outer
    errgroup over regions (capped at CUDLY_GCP_REGION_PARALLELISM,
    default 10) + inner sub-errgroup over (compute, sql) per region.
    New TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation
    • TestGCPRegionConcurrency pin the contract and env-knob parser.
  • perf(scheduler) (closes perf(scheduler): parallelize provider collection loop (AWS/Azure/GCP) #268)CollectRecommendations's
    provider loop now fans out AWS/Azure/GCP under errgroup.
    Deterministic merge walks EnabledProviders in config order so
    successfulProviders ordering is independent of goroutine completion
    order. New TestScheduler_CollectRecommendations_ParallelProviders
    pins the config-order + ctx-cancellation contracts.

Behaviour change

Per-service / per-(region, service) errors that previously were silently
swallowed by continue (AWS) / if err == nil { ... } (GCP) are now
logged at WARN. Errors still don't propagate to callers — only the log
severity changes from "invisible" to WARN, matching the observability
the Azure parallelisation added.

Wall-time impact (cumulative with #258 + #260)

  • AWS per account: sum(5 per-service) + 400ms stagger → max(5)
  • GCP per account: N_regions × (compute + sql)
    ceil(N_regions / cap) × max(compute, sql) (~6× speedup at 30
    regions / cap 10, before counting per-region sub-fan-out)
  • Scheduler total: sum(per-provider)max(per-provider)
    (~30-45s → ~max(15-20s) with all three providers enabled)

User-facing 502 was already gone via #260's async self-invoke; this
PR shrinks the in-flight freshness-banner window and gives cron runs
more headroom before the next tick.

Test plan

  • go build ./... from repo root — clean
  • go test ./internal/scheduler/... — 74 pass (was 71; +3 new)
  • go test ./recommendations/... from providers/aws/ — 176 pass
  • go test . from providers/gcp/ (top-level package) — 18 pass
    (recommendations + concurrency + shouldIncludeService)
  • Pre-commit hooks pass (gocyclo, secrets scan, gosec, trivy, Go
    tests, golangci-lint)
  • Post-merge: monitor scheduler Lambda CloudWatch logs for the
    freshness-banner in-flight window — expected to shrink to
    max(per-provider).

🤖 Generated with claude-flow

Summary by CodeRabbit

  • New Features

    • Recommendations are now collected concurrently across providers and services, significantly improving collection performance.
  • Bug Fixes

    • Provider and service failures are now isolated; one failure no longer blocks other providers from completing their collections.
    • Context cancellation is properly propagated after concurrent collection completes.
  • Chores

    • Updated dependencies to support improved concurrency handling.

cristim added 3 commits May 4, 2026 21:30
…tions (closes #266)

The dispatcher in providers/aws/recommendations/client.go::GetAllRecommendations
ran 5 service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) one
after another with an artificial 100ms stagger between each. Each per-
service call hits AWS Cost Explorer's GetReservationPurchaseRecommendation,
which routinely takes multiple seconds. Total wall time per account scaled
as sum(per-service) instead of max(per-service).

Mirrors the Azure parallelisation in providers/azure/recommendations.go
(closes #258, commit b10326c). The five calls have no inter-service
dependency, so they can run concurrently under errgroup.WithContext. Each
goroutine captures its own error in a closure-scoped variable and returns
nil to the group so one service's failure does not cancel siblings
(matching the previous loop's `continue`-on-error tolerance).

Results are merged in the canonical order
EC2 → RDS → ElastiCache → OpenSearch → Redshift after Wait() so any
order-sensitive callers stay stable. After Wait(), ctx.Err() is checked
and propagated so a canceled parent ctx surfaces as context.Canceled /
context.DeadlineExceeded rather than being swallowed by the per-service
error-isolation goroutines (which all return nil).

The 100ms artificial stagger is removed — it was a sequential-mode rate-
limit hack that adds 400ms per account for no benefit under concurrent
fan-out (Cost Explorer rate limits apply per request, not per consecutive
request).

Behaviour change vs the previous sequential loop: per-service errors are
now logged at WARN via the new mergeServiceResults helper. The previous
loop swallowed them silently with a bare `continue`, leaving operators no
signal when a single service was misbehaving. The serviceResult struct +
mergeServiceResults helper extraction also keeps GetAllRecommendations
under the project's gocyclo gate (.golangci.yml min-complexity: 15).

New test TestGetAllRecommendations_PropagatesContextCancellation pins the
contract: calling GetAllRecommendations with a pre-cancelled context must
return context.Canceled. Mirrors the Azure equivalent at
providers/azure/recommendations_test.go::TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation.

Expected end-to-end after this change: ~max(individual call durations)
instead of sum + 4×100ms stagger.
…tions (closes #267)

The dispatcher in providers/gcp/recommendations.go::GetRecommendations was
doubly serial — `for region { for service { ... } }` — fetching Compute
Engine then Cloud SQL recommendations one region at a time. With ~30+
GCP regions × 2 services, that's 60+ sequential RPCs per account against
the project-scoped Recommender API. Even at 200ms per call this is a
12s+ floor; in practice most calls are slower. After #260 (async self-
invoke) this no longer triggers user-facing 502s, but it directly
inflates the scheduler Lambda's runtime and the in-flight freshness-
banner window.

Two-level concurrent fan-out, mirroring the Azure pattern in
providers/azure/recommendations.go (closes #258, commit b10326c) and
the AWS service-loop parallelisation (closes #266):

  - Outer: errgroup over regions, capped at gcpRegionConcurrency()
    (CUDLY_GCP_REGION_PARALLELISM, default 10) to stay polite to the
    project-scoped Recommender API quota. Lower than the existing
    account-level CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) because
    each account already gets its own goroutine slot from
    fanOutPerAccount; cumulative concurrency would multiply otherwise.
  - Inner: within each region's goroutine, Compute + Cloud SQL run as
    two further goroutines under a per-region sub-errgroup, so per-
    region cost is max(compute, sql) rather than compute + sql.

Each goroutine returns nil to its errgroup so a single per-(region,
service) failure does not cancel siblings — preserves the previous
silent-skip-on-err shape semantically.

Behaviour change vs the previous nested for-loops: per-(region,
service) errors that were silently swallowed by the previous
`if err == nil { ... }` shape are now logged at WARN with the
region+service tag so misconfigured projects are diagnosable. Errors
still don't propagate to callers (preserving silent-skip), only the
log severity changes from "invisible" to WARN.

After the outer Wait(), ctx.Err() is checked and propagated so a
canceled parent ctx surfaces as context.Canceled / context.Deadline-
Exceeded rather than being swallowed by the per-region error-isolation
goroutines (which all return nil).

Result merging walks regions in sorted order and appends compute then
sql per region — output is deterministic independent of GCP API
region-list ordering or goroutine completion order.

The collectRegion helper extraction also keeps GetRecommendations
under the project's gocyclo gate (.golangci.yml min-complexity: 15)
after the post-Wait ctx.Err() block was added.

Tests added:
  - TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation
    pins the contract that a pre-cancelled ctx surfaces as context.Canceled.
  - TestGCPRegionConcurrency pins the env-knob parser semantics
    (unset/positive/non-numeric/zero/negative/explicit-unset).

Expected end-to-end after this change:
ceil(N_regions / cap) × max(compute, sql) instead of
N_regions × (compute + sql) — for a project with 30 regions and a cap
of 10, that's a ~6× speedup before counting the per-region service-
level parallelism win.
…loses #268)

The provider loop in scheduler.go::CollectRecommendations ran AWS, Azure
and GCP collection sequentially: total wall time = sum(per-provider)
rather than max(per-provider). After #260 (async self-invoke) the user-
facing 502 on POST /api/recommendations/refresh is gone, but the
scheduler Lambda itself still pays the serial cost — visible as a
longer "collection in flight" freshness-banner window and reduced
cron-tick headroom.

Provider-level fan-out under errgroup, mirroring the Azure and AWS
service-loop parallelisations (#258 / #266). Each per-provider goroutine
returns nil to the group so a single provider's error does not cancel
its siblings — preserves the previous loop's `continue`-on-error
semantics. Per-provider results are written into a map under a single
mutex (the same pattern as fanOutPerAccount at scheduler.go:393).

After Wait, ctx.Err() is checked and propagated so a canceled parent
ctx surfaces as context.Canceled / context.DeadlineExceeded rather than
being swallowed by the per-provider error-isolation goroutines (which
all return nil).

Deterministic merge: walk EnabledProviders in config order to assemble
successfulProviders / successfulCollects / allRecommendations /
totalSavings / failedProviders. Order is by config, NOT by goroutine
completion order or by Go's randomised map iteration — keeps existing
tests stable and the freshness-banner / notification-email content
predictable.

The new collectAllProviders helper extraction also keeps
CollectRecommendations under the project's gocyclo gate
(.golangci.yml min-complexity: 15) after the errgroup + post-Wait
ctx.Err() block was added.

No concurrency cap on the outer fan-out — the universe is at most 3
providers; the per-account fan-out inside each provider is still
bounded by CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) and per-region
GCP fan-out by CUDLY_GCP_REGION_PARALLELISM (default 10, from #267).

Test mocks updated: per-provider mock methods (CreateAndValidateProvider,
ListCloudAccounts, GetRecommendationsClient, GetAllRecommendations) are
now invoked from inside the errgroup goroutine, so they receive the
errgroup-derived gctx rather than the caller's ctx. Mock setups changed
from literal ctx to mock.Anything where the call sits inside the
goroutine path; calls made before/after the fan-out (GetGlobalConfig,
SendNewRecommendationsNotification, persistCollection writes) still
pin literal ctx because they run on the caller's context unchanged.

New TestScheduler_CollectRecommendations_ParallelProviders pins three
contracts:

  1. successfulProviders ordering matches EnabledProviders config order,
     not goroutine completion order or alphabetical. Asserted with a
     deliberately non-alphabetical config (["gcp", "aws", "azure"])
     where AWS fails (ambient credentials → factory mock returns error)
     and Azure/GCP succeed empty (no enabled accounts) — the resulting
     successfulProviders must be ["gcp", "azure"], distinguishable from
     any of: input order, alphabetical, or arbitrary map iteration.

  2. ctx cancellation propagates: a pre-cancelled ctx surfaces as
     context.Canceled (not a "successful but empty" CollectResult).

Expected wall-time impact: CollectRecommendations drops from
sum(per-provider) to max(per-provider). With AWS+Azure+GCP all enabled
and Azure typically the slowest at ~15-20s post-#258, total drops from
~30-45s to ~max(15-20s).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5f722e7-5d3d-46d8-bb97-e5fbfb995e98

📥 Commits

Reviewing files that changed from the base of the PR and between 472ba71 and 5a7a2ad.

⛔ Files ignored due to path filters (1)
  • providers/aws/go.sum is excluded by !**/*.sum
📒 Files selected for processing (8)
  • internal/scheduler/scheduler.go
  • internal/scheduler/scheduler_test.go
  • providers/aws/go.mod
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go
  • providers/gcp/go.mod
  • providers/gcp/recommendations.go
  • providers/gcp/recommendations_test.go

📝 Walkthrough

Walkthrough

This PR implements concurrent fan-out parallelization at three layers: AWS service collection (5 sequential calls→parallel), GCP region/service collection (doubly serial→nested parallel with configurable concurrency cap), and scheduler provider collection (3 sequential→parallel). Context cancellation is explicitly propagated after errgroup.Wait(), per-provider/service/region failures are isolated via error tracking maps, and deterministic merge order is preserved via config-order iteration or region sorting. Go module dependencies are updated to support errgroup.

Changes

AWS Service-Level Parallelization

Layer / File(s) Summary
Dependencies
providers/aws/go.mod
Added golang.org/x/sync v0.20.0 as direct dependency and updated Go version to 1.25.0.
Service Fan-Out Implementation
providers/aws/recommendations/client.go
Replaced sequential per-service loop with errgroup.WithContext fan-out across EC2, RDS, ElastiCache, OpenSearch, Redshift. Removed 100ms artificial stagger. Added serviceResult and mergeServiceResults helpers for isolated error handling and deterministic order.
Context Propagation
providers/aws/recommendations/client.go
After errgroup.Wait(), ctx.Err() is explicitly checked and propagated so parent context cancellation surfaces as context.Canceled rather than silent success.
Test Coverage
providers/aws/recommendations/client_test.go
Added TestGetAllRecommendations_PropagatesContextCancellation asserting pre-cancelled context returns context.Canceled and nil recommendations.

GCP Region/Service-Level Parallelization

Layer / File(s) Summary
Dependencies & Configuration
providers/gcp/go.mod, providers/gcp/recommendations.go
Promoted golang.org/x/sync v0.6.0 to direct dependency. Added CUDLY_GCP_REGION_PARALLELISM environment variable support (default 10) via gcpRegionConcurrency() helper and defaultGCPRegionConcurrency constant.
Region Fan-Out
providers/gcp/recommendations.go
Replaced sequential region loop with errgroup.WithContext capped at gcpRegionConcurrency(). Per-region results stored in mutex-guarded map; deterministic merge performed by sorting region names post-wait.
Per-Region Service Parallelization
providers/gcp/recommendations.go
New collectRegion helper runs compute and Cloud SQL fetches concurrently within each region. Per-service errors log at WARN and are isolated; neither service failure cancels its sibling.
Context Propagation
providers/gcp/recommendations.go
After outer errgroup wait, ctx.Err() is checked and propagated to surface parent context cancellation.
Test Coverage
providers/gcp/recommendations_test.go
Added TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation and TestGCPRegionConcurrency for context handling and environment-driven concurrency configuration.

Scheduler Provider-Level Parallelization

Layer / File(s) Summary
Provider Fan-Out Implementation
internal/scheduler/scheduler.go
Replaced sequential per-provider loop with collectAllProviders method. Uses errgroup.WithContext to fan-out AWS, Azure, GCP collection in parallel. New providerOutcome struct and mutex-guarded outcome map track success/failure per provider.
Deterministic Aggregation
internal/scheduler/scheduler.go
After errgroup.Wait(), outcomes are merged in EnabledProviders config order (not completion order) to produce successfulProviders, successfulCollects, allRecommendations, totalSavings, and failedProviders map.
Context Propagation
internal/scheduler/scheduler.go
Per-provider errors are isolated (goroutines return nil); failed outcomes populate failedProviders but don't cancel siblings. Post-wait ctx.Err() is checked and propagated.
Test Coverage
internal/scheduler/scheduler_test.go
Added TestScheduler_CollectRecommendations_ParallelProviders validating deterministic ordering, per-provider error isolation, and context cancellation propagation. Updated existing tests to relax context matching (use mock.Anything instead of strict ctx) to account for errgroup-derived contexts.

Sequence Diagram

sequenceDiagram
    participant Scheduler
    participant EG_Providers as errgroup<br/>(Providers)
    participant AWS as AWS Provider
    participant EG_Services as errgroup<br/>(Services)
    participant EC2 as EC2 Service
    participant RDS as RDS Service
    participant Azure as Azure Provider
    participant GCP as GCP Provider
    participant ResultMap as Outcome Map<br/>(mutex)

    Scheduler->>EG_Providers: collectAllProviders(ctx)
    
    par Provider Fan-Out
        EG_Providers->>AWS: Go: collectProviderRecommendations(ctx)
        EG_Providers->>Azure: Go: collectProviderRecommendations(ctx)
        EG_Providers->>GCP: Go: collectProviderRecommendations(ctx)
        
        rect rgba(0, 128, 255, 0.5)
            Note over AWS,EG_Services: AWS (simplified flow)
            AWS->>EG_Services: GetAllRecommendations(ctx)<br/>fan-out 5 services
            par Service Fan-Out
                EG_Services->>EC2: Go: GetRecommendationsForService
                EG_Services->>RDS: Go: GetRecommendationsForService
            end
            EG_Services->>AWS: Wait() + ctx.Err()
            AWS->>ResultMap: record outcome (success/error)
        end
        
        rect rgba(100, 200, 100, 0.5)
            Note over Azure: Azure (already concurrent)
            Azure->>ResultMap: record outcome
        end
        
        rect rgba(255, 165, 0, 0.5)
            Note over GCP: GCP (region+service fan-out)
            GCP->>ResultMap: record outcome
        end
    end
    
    EG_Providers->>EG_Providers: Wait()
    EG_Providers->>EG_Providers: ctx.Err() check
    EG_Providers->>Scheduler: Merge outcomes<br/>in config order
    Scheduler->>Scheduler: Return aggregated results<br/>(allRecommendations,<br/>totalSavings,<br/>successfulProviders,<br/>failedProviders)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • LeanerCloud/CUDly#258: This PR addresses the AWS (#266), GCP (#267), and scheduler (#268) parallelization work that follows the pattern established by the Azure parallelization in #258.

Possibly related PRs

  • LeanerCloud/CUDly#195: Modifies providers/aws/recommendations/client.go with similar context and result aggregation patterns.
  • LeanerCloud/CUDly#260: Modifies internal/scheduler/scheduler.go's CollectRecommendations method with related state management and cleanup concerns.

Suggested labels

effort/m, perf, concurrent

Poem

🐰 Five services dance in parallel streams,
Regions fan out like spring's wildest dreams,
Providers leap forth, no longer in queue,
From sequential slowness to speedups that flew!
Deterministic merging keeps order so true,
Cancellation flows cleanly when contexts are through. 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf: parallelize recommendations collection at every level' directly summarizes the main change across all three related issues, indicating the shift from sequential to concurrent collection at AWS, GCP, and scheduler levels.
Linked Issues check ✅ Passed All code changes directly address the three linked issues: AWS service-loop parallelization under errgroup [#266], GCP region+service fan-out with configurable concurrency [#267], and scheduler provider fan-out [#268]. Context cancellation propagation, per-level error isolation, deterministic merge ordering, and supporting tests are all implemented as specified.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the three parallelization objectives. The go.mod updates (adding golang.org/x/sync, updating Go version) are necessary dependencies for errgroup-based concurrency across all three modules.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/parallelize-recommendations-collection

Review rate limit: 4/5 reviews remaining, refill in 12 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim cristim added priority/p2 Backlog-worthy severity/medium Moderate harm urgency/eventually No deadline urgency/this-sprint Within the current sprint impact/internal Team-internal only impact/many Affects most users effort/s Hours effort/xs Trivial / one-liner type/bug Defect type/feat New capability triaged Item has been triaged labels May 4, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit defac6c into feat/multicloud-web-frontend May 4, 2026
4 checks passed
@cristim cristim deleted the perf/parallelize-recommendations-collection branch May 4, 2026 20:30
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai review

Pushed two follow-up commits in response to the live-deploy OOM:

  • perf(concurrency) (aec56d3): single shared semaphore stashed on ctx, capping aggregate concurrent IO across the whole collect tree at CUDLY_MAX_PARALLELISM (default 20). Leaves Acquire/Release; intermediates don't, so no deadlock by holding-permit-while-waiting-on-sub-permits. Acquire/Release are no-ops when no semaphore is on ctx, so CLI / unit-test paths are unaffected.
  • infra(aws/dev) (bfda7e4): bump dev Lambda memory 512 → 2048 MB so the user can verify the new stack completes end-to-end on a real deploy. Stage/prod untouched.

Test impact: 11 new pkg/concurrency tests (parser + bounds + cancel-respect); 2 scheduler mocks loosened (UpsertRecommendations/SendNewRecommendationsNotification) since they now receive the wrapped ctx.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Tip

For best results, initiate chat on the files or code changes.

@cristim Sure! I'll review the two follow-up commits now — the shared semaphore approach for CUDLY_MAX_PARALLELISM and the dev Lambda memory bump.

[review]

cristim added a commit that referenced this pull request May 4, 2026
Two CodeRabbit findings on PR #270:

1. **`pkg/concurrency/concurrency_test.go` (CRITICAL)**:
   `require.NoError` was called from inside the worker goroutines in
   `TestSharedSemaphore_BoundsConcurrency`. Testify's documented contract
   is that `require.*` / `FailNow`-style helpers must run on the test's
   own goroutine — calling them from a worker uses `runtime.Goexit` on
   the worker instead of safely stopping the test, which can hang or
   skip cleanup and produces race-detector noise.

   Workers now capture their Acquire result on a buffered `errCh` and
   the main goroutine drains the channel and asserts after `wg.Wait()`.
   `Release` is deferred only on a successful Acquire, matching the
   documented pairing contract.

2. **`providers/aws/recommendations/client.go` (MAJOR)**:
   The per-service `g.Go` goroutines in `GetAllRecommendations` were
   calling `concurrency.Acquire`/`Release` around the whole
   `GetRecommendationsForService` sweep — but that helper internally
   loops 2 × 3 (term, payment) variants and waits on the rate-limiter
   inside each retry, so a throttled service would hold a global slot
   for seconds while no Cost Explorer request was actually in flight.
   Effective parallelism dropped well below `CUDLY_MAX_PARALLELISM`.

   The Acquire/Release boundary moves down to each individual
   `costExplorerClient.GetReservationPurchaseRecommendation` call (and
   the SP equivalent in `parser_sp.go`). The rate-limiter's
   `Wait` happens *outside* the permit, so a goroutine waiting on
   exponential-backoff backoff doesn't tie up a slot. Each retry
   acquires a fresh permit, so throughput now scales with the cap
   regardless of how many services are individually throttled.

Tests:
- `go test ./recommendations/...` from `providers/aws/` — pass
- `go test -race ./concurrency/...` from `pkg/` — race-clean
- Existing `go test -race ./recommendations/...` reveals a pre-existing
  data race on `Client.rateLimiter` (concurrent `Reset` / state
  mutation across the per-service goroutines), introduced by the
  parallelisation in #269 (`defac6c2a`). Filed as #271 — out of scope
  for this CR-fix commit, fix is straightforward (per-goroutine rate-
  limiter or internal mutex).

Other CodeRabbit checks (Azure / GCP per-service goroutines holding
permits across SDK calls) are NOT applied here because:
- Azure's per-service SDK calls don't have an explicit rate-limiter
  Wait loop equivalent to AWS's — the inner ARM/pricing calls block
  on network IO, not exponential backoff.
- GCP's `collectRegion` sub-fan-out goroutines do similarly direct SDK
  calls without an explicit Wait loop.
- CR didn't flag those layers; if a future review surfaces the same
  argument with concrete throughput evidence, we can mirror the AWS
  pattern there.
cristim added a commit that referenced this pull request May 4, 2026
…llowup #269) (#270)

* perf(concurrency): cap aggregate collect parallelism via shared semaphore

The recommendations-collection fan-out has up to four nested levels:
provider → account → service|region → per-region service. Each level was
independently capped, so peak goroutine counts multiplied through the
tree (3 providers × 20 accounts × 30 GCP regions × 2 services = 3 600
in-flight gRPC/HTTP clients on a real multi-account deploy). On a 512 MB
Lambda this exhausted memory before the work could finish — observed in
dev as `signal: killed` at Max Memory Used = 512 MB after ~28s, with the
runtime never reaching the 60s timeout.

A single semaphore stashed on ctx now lets every leaf goroutine (the
goroutine that issues the actual cloud-API call) Acquire one slot before
doing IO and Release after, so aggregate concurrent IO is hard-bounded
by CUDLY_MAX_PARALLELISM (default 20) regardless of how nested the
dispatch is. Intermediate dispatchers (provider, account, GCP region)
do NOT acquire — they only launch sub-goroutines — so no goroutine can
deadlock by holding a permit while waiting for sub-permits.

Mechanics:

- New pkg/concurrency package: WithSharedSemaphore / SharedSemaphore /
  Acquire / Release helpers that read the semaphore from ctx, plus
  MaxParallelismFromEnv reading CUDLY_MAX_PARALLELISM (default 20).
  Acquire/Release are no-ops when no semaphore is on ctx — CLI tools
  and unit tests skip the semaphore entirely without per-call branching.

- Scheduler.CollectRecommendations allocates the semaphore (size from
  env) and attaches to ctx via concurrency.WithSharedSemaphore BEFORE
  the provider fan-out, so the wrapped ctx flows through every nested
  level. Logs the effective cap on entry.

- Three leaf fan-outs wired:
    - providers/aws/recommendations/client.go::GetAllRecommendations —
      5 service goroutines (EC2 / RDS / ElastiCache / OpenSearch /
      Redshift)
    - providers/azure/recommendations.go::GetRecommendations — 5
      service goroutines (compute / database / cache / cosmosdb /
      advisor); Acquire/Release boilerplate extracted into a goService
      helper so the function stays under gocyclo's 10-branch gate.
    - providers/gcp/recommendations.go::collectRegion — 2 sub-fan-out
      goroutines (compute / cloudsql) per region

  Each leaf calls Acquire(gctx) at the top, defers Release(gctx). On
  Acquire failure (parent ctx cancelled mid-wait) the leaf's per-service
  err variable is set to ctx.Err() and the goroutine returns nil to the
  errgroup — preserves the documented error-isolation contract.

Test mocks updated: persistCollection's UpsertRecommendations and the
notification email's SendNewRecommendationsNotification both run inside
CollectRecommendations after the ctx is wrapped, so their mock setups
were pinning the original (unwrapped) ctx. mock.Anything is the
appropriate looseness — the wrapped ctx is implementation detail.

New pkg/concurrency tests pin:
  - MaxParallelismFromEnv env-knob parser (unset / positive / zero /
    negative / non-numeric / explicit-unset)
  - Acquire/Release are no-ops when no semaphore on ctx
  - WithSharedSemaphore returns ctx unchanged when sem is nil
  - 20 goroutines under cap=3 never see >3 in-flight (load-bearing)
  - Acquire returns ctx.Err() when cancelled mid-wait

This is a smoke-test config — the user's dev deploy hit OOM with the
existing per-level caps; CUDLY_MAX_PARALLELISM=20 (paired with a Lambda
memory bump to 2 GB in a follow-up commit) should give the dev refresh
headroom to complete. Operators can dial up further via env override.

* infra(aws/dev): bump Lambda memory 512 MB → 2 GB for collect headroom

The dev Lambda hit Max Memory Used = 512 MB and was killed by the
runtime at ~28s (Duration 28701ms / Timeout 60s) on a user-triggered
recommendations refresh. The collection fan-out (post-#266 #267 #268)
keeps many concurrent gRPC/HTTP clients in flight — 30+ GCP regions ×
2 services each, plus 5 AWS services and 5 Azure services — and 512 MB
is too tight for that working-set even with the new shared-semaphore
cap.

This is a temporary headroom bump scoped to dev so the user can verify
the post-#266/#267/#268/concurrency-semaphore stack actually completes
end-to-end on a deployed Lambda. The shared semaphore (preceding
commit) is the durable fix for peak concurrency; this just gives the
function room to breathe while we observe the new behaviour live.

Only github-dev.tfvars is touched. github-staging.tfvars and
github-prod.tfvars stay at 1024 MB — they don't exercise every
provider with mostly-broken credentials the way dev does, and bumping
prod without first observing dev would be premature.

dev.tfvars (the user's local-apply config, gitignored) is also bumped
locally to keep parity with the CI deploy until this lands.

* fix(concurrency,aws): address PR #270 CodeRabbit review

Two CodeRabbit findings on PR #270:

1. **`pkg/concurrency/concurrency_test.go` (CRITICAL)**:
   `require.NoError` was called from inside the worker goroutines in
   `TestSharedSemaphore_BoundsConcurrency`. Testify's documented contract
   is that `require.*` / `FailNow`-style helpers must run on the test's
   own goroutine — calling them from a worker uses `runtime.Goexit` on
   the worker instead of safely stopping the test, which can hang or
   skip cleanup and produces race-detector noise.

   Workers now capture their Acquire result on a buffered `errCh` and
   the main goroutine drains the channel and asserts after `wg.Wait()`.
   `Release` is deferred only on a successful Acquire, matching the
   documented pairing contract.

2. **`providers/aws/recommendations/client.go` (MAJOR)**:
   The per-service `g.Go` goroutines in `GetAllRecommendations` were
   calling `concurrency.Acquire`/`Release` around the whole
   `GetRecommendationsForService` sweep — but that helper internally
   loops 2 × 3 (term, payment) variants and waits on the rate-limiter
   inside each retry, so a throttled service would hold a global slot
   for seconds while no Cost Explorer request was actually in flight.
   Effective parallelism dropped well below `CUDLY_MAX_PARALLELISM`.

   The Acquire/Release boundary moves down to each individual
   `costExplorerClient.GetReservationPurchaseRecommendation` call (and
   the SP equivalent in `parser_sp.go`). The rate-limiter's
   `Wait` happens *outside* the permit, so a goroutine waiting on
   exponential-backoff backoff doesn't tie up a slot. Each retry
   acquires a fresh permit, so throughput now scales with the cap
   regardless of how many services are individually throttled.

Tests:
- `go test ./recommendations/...` from `providers/aws/` — pass
- `go test -race ./concurrency/...` from `pkg/` — race-clean
- Existing `go test -race ./recommendations/...` reveals a pre-existing
  data race on `Client.rateLimiter` (concurrent `Reset` / state
  mutation across the per-service goroutines), introduced by the
  parallelisation in #269 (`defac6c2a`). Filed as #271 — out of scope
  for this CR-fix commit, fix is straightforward (per-goroutine rate-
  limiter or internal mutex).

Other CodeRabbit checks (Azure / GCP per-service goroutines holding
permits across SDK calls) are NOT applied here because:
- Azure's per-service SDK calls don't have an explicit rate-limiter
  Wait loop equivalent to AWS's — the inner ARM/pricing calls block
  on network IO, not exponential backoff.
- GCP's `collectRegion` sub-fan-out goroutines do similarly direct SDK
  calls without an explicit Wait loop.
- CR didn't flag those layers; if a future review surfaces the same
  argument with concrete throughput evidence, we can mirror the AWS
  pattern there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/s Hours effort/xs Trivial / one-liner impact/internal Team-internal only impact/many Affects most users priority/p2 Backlog-worthy severity/medium Moderate harm triaged Item has been triaged type/bug Defect type/feat New capability urgency/eventually No deadline urgency/this-sprint Within the current sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant