feat(api): async recommendation refresh via scheduler Lambda invoke (closes #257) by cristim · Pull Request #260 · LeanerCloud/CUDly

cristim · 2026-05-04T14:02:00Z

Summary

Async recommendation refresh end-to-end (closes #257). The 502 on user-triggered refresh and on first-load was caused by the API Lambda doing the full provider fan-out synchronously and exceeding its 60s timeout — Azure alone takes >60s (#258 is the orthogonal parallelization quick-win).

This PR splits the request path so the user-facing API returns immediately:

POST /api/recommendations/refresh:
  1. requirePermission(view, recommendations)
  2. read freshness for the response body
  3. atomically MarkCollectionStarted
       (returns false → 409 if another collection is in flight, with a
        5-min auto-recovery window in case the scheduler crashed mid-run)
  4. self-invoke the API Lambda via lambda:InvokeFunction with
     InvocationType=Event, hitting the EventBridge "scheduled" path
     (CUDly has one Lambda; the scheduler is just the cron-task code
     path inside it)
  5. return 202 with started_at + previous last_collected_at

The same async-invoke path is wired into the cold-start branch of GET /api/recommendations so first-load no longer blocks on a 60s provider fan-out.

Bookkeeping additions

StoreInterface (Postgres + 5 test mocks across analytics/api/purchase/scheduler/server packages):

MarkCollectionStarted(ctx) (bool, error)   // race-safe set; 5-min recovery
ClearCollectionStarted(ctx) error          // called by scheduler on exit

Migration 000047 adds last_collection_started_at on recommendations_state (default NULL).

Folded-in followups (originally listed as out-of-scope)

After landing the backend, I folded in the deploy + UX pieces in a second commit so the PR is end-to-end functional and verifiable in deployed mode:

Terraform (terraform/modules/compute/aws/lambda/main.tf):

aws_iam_role_policy.async_self_invoke — grants lambda:InvokeFunction scoped to the API Lambda's own ARN
SCHEDULER_LAMBDA_ARN env var derived from aws_partition + aws_caller_identity + var.region + var.stack_name (avoids self-reference cycle on aws_lambda_function.main.arn)
Data sources aws_partition.current + aws_caller_identity.current

Frontend:

RefreshRecommendationsResult and RecommendationsFreshness types updated for the new payload shapes (legacy fields kept optional during transition so existing fixtures still type-check)
freshness.ts refresh-click handler now polls /recommendations/freshness every 5 s after the POST returns, waiting for last_collection_started_at to clear (or for last_collected_at to advance past started_at). 10-min safety cap.

Test plan

go vet ./... clean
go test ./internal/... — 3,614 tests pass
npm test (frontend) — 1,453 tests pass
tsc --noEmit clean
Pre-commit hooks (gocyclo, gosec, trivy, secrets scan, gofmt, govet, terraform fmt+validate+tflint, frontend build+tests) all green
CI green on push
Deployed verification — bump RDS Proxy / Lambda timeout if Azure still > 60s (covered by perf(azure): parallelize the 5 sequential service calls in GetRecommendations #258 parallelization)

Closes #257.

…loses #257) The 502 on user-triggered refresh and on first-load (cold-start) was caused by the API Lambda doing the full provider-fan-out synchronously and exceeding its 60s timeout — Azure alone takes >60s. The collection completes successfully in the scheduler Lambda (longer timeout), but the API Lambda was the wrong place to do the work. This change splits the request path so the user-facing API returns immediately: POST /api/recommendations/refresh: 1. requirePermission(view, recommendations) 2. read freshness for the response body 3. atomically MarkCollectionStarted (returns false → 409 if another collection is already in flight, with a 5-minute auto-recovery window in case the scheduler crashed mid-run and never cleared) 4. async-invoke the scheduler Lambda via lambda:InvokeFunction with InvocationType=Event (fire-and-forget) when SCHEDULER_LAMBDA_ARN is set; otherwise fall back to synchronous CollectRecommendations so local/HTTP-mode dev still works 5. return 202 with started_at + previous last_collected_at so the frontend can render "Last refreshed N min ago — collecting…" The same async-invoke path is wired into the cold-start branch of GET /api/recommendations so first-load no longer blocks on a 60s provider fan-out. Bookkeeping additions in StoreInterface (Postgres + 5 test mocks across analytics/api/purchase/scheduler/server packages): MarkCollectionStarted(ctx) (bool, error) // race-safe set ClearCollectionStarted(ctx) error // called by scheduler at end Migration 000047 adds the last_collection_started_at column on recommendations_state (default NULL). Out of scope for this PR (filed as separate triaged follow-ups when needed): the API Lambda's IAM grant for lambda:InvokeFunction on the scheduler ARN (Terraform), and the frontend refresh button + polling banner. The backend handler degrades to synchronous-collect when SCHEDULER_LAMBDA_ARN is unset, so deployed verification is gated on the IAM + env-var follow-up.

coderabbitai · 2026-05-04T14:02:08Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2e20da7-9472-49bb-bcfa-161cb4d009e4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR enables asynchronous recommendation collection to prevent API Lambda timeouts on cold-start. It adds collection-state tracking via a new last_collection_started_at timestamp, implements a Lambda self-invocation mechanism in a new POST handler, persists in-flight state to the database, and returns HTTP 202 to clients when async collection begins, allowing polling for completion.

Changes

Async Recommendation Collection State Machine

Layer / File(s)	Summary
Data Shape `internal/config/types.go`	`RecommendationsFreshness` adds `LastCollectionStartedAt *time.Time` field to track when async collection begins.
Database Schema `internal/database/postgres/migrations/000047_*`	Migration adds nullable `last_collection_started_at` TIMESTAMPTZ column to `recommendations_state` table.
Store Interface & Implementation `internal/config/interfaces.go`, `internal/config/store_postgres_recommendations.go`	`StoreInterface` adds `MarkCollectionStarted(ctx) (bool, error)` (atomic "won race" check) and `ClearCollectionStarted(ctx) error`. PostgreSQL implementation: `MarkCollectionStarted` sets timestamp only if NULL or older than 5 minutes; `ClearCollectionStarted` nullifies it. `GetRecommendationsFreshness` and `SetRecommendationsCollectionError` now read/write `last_collection_started_at`.
Handler & Lambda Invocation `internal/api/handler.go`, `internal/api/handler_recommendations_refresh.go`	`Handler` gains `lambdaInvoker LambdaInvokerInterface` field. New `postRefreshRecommendations` handler enforces permission, reads freshness, atomically marks collection start (409 on conflict), then invokes scheduler async via `SCHEDULER_LAMBDA_ARN` or falls back to sync `CollectRecommendations`. Returns HTTP 202 with `RefreshResponse` containing `started_at` and `last_collected_at`. Supporting `asyncInvokeSelf`, `getLambdaInvoker`, and `triggerColdStartCollect` helpers provide async-invoke abstraction and test injection.
Router Wiring `internal/api/router.go`	`refreshRecommendationsHandler` now dispatches to `postRefreshRecommendations` instead of directly calling `scheduler.CollectRecommendations`.
Scheduler Cleanup `internal/scheduler/scheduler.go`	`CollectRecommendations` now defers `clearCollectionStartedBestEffort` to guarantee `last_collection_started_at` is cleared on completion (success or failure), allowing frontend polling to reliably detect end-of-collection.
Test Mocks & Assertions `internal/api/mocks_test.go`, `internal/api/handler_test.go`, `internal/analytics/collector_test.go`, `internal/purchase/mocks_test.go`, `internal/scheduler/scheduler_test.go`, `internal/scheduler/scheduler_overrides_test.go`, `internal/scheduler/scheduler_suppressions_test.go`, `internal/server/test_helpers_test.go`	All mock `ConfigStore` implementations extended with `MarkCollectionStarted` and `ClearCollectionStarted` methods (default: return `true, nil` and `nil`). `TestHandler_HandleRequest_RefreshRecommendations` updated to mock freshness reads, collection-start marking, and accept 200 or 202 status.

Sequence Diagram

sequenceDiagram
    participant Client
    participant APIHandler as API Handler
    participant ConfigStore as Config Store
    participant LambdaInvoker as Lambda Invoker
    participant SchedulerLambda as Scheduler Lambda
    
    rect rgba(200, 150, 255, 0.5)
    Note over Client,SchedulerLambda: Async Recommendation Collection (POST /api/recommendations/refresh)
    Client->>APIHandler: POST /api/recommendations/refresh
    APIHandler->>ConfigStore: MarkCollectionStarted()
    alt Collection already in-flight
        ConfigStore-->>APIHandler: false (409 Conflict)
        APIHandler-->>Client: 409 Collection in progress
    else First to acquire lock
        ConfigStore-->>APIHandler: true, nil
        APIHandler->>ConfigStore: GetRecommendationsFreshness()
        ConfigStore-->>APIHandler: freshness
        alt SCHEDULER_LAMBDA_ARN configured
            APIHandler->>LambdaInvoker: asyncInvokeSelf(SCHEDULER_LAMBDA_ARN)
            LambdaInvoker->>SchedulerLambda: Invoke(InvocationTypeEvent, event="scheduled_recommendations")
            SchedulerLambda-->>LambdaInvoker: OK (async, no wait)
            alt Invoke succeeded
                APIHandler-->>Client: 202 {started_at, last_collected_at}
                SchedulerLambda->>ConfigStore: CollectRecommendations() + defer ClearCollectionStarted()
                ConfigStore-->>SchedulerLambda: done
            else Invoke failed
                APIHandler->>ConfigStore: ClearCollectionStarted() [rollback]
                ConfigStore-->>APIHandler: cleared
                APIHandler-->>Client: 500/502 error
            end
        else Fallback: no ARN, sync collect
            APIHandler->>ConfigStore: CollectRecommendations()
            ConfigStore-->>APIHandler: collected
            APIHandler-->>Client: 202 {started_at, last_collected_at}
        end
    end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

priority/p1, severity/high, urgency/now, type/feat

Poem

A Lambda that timed out at sixty,
Now async, returning quite nifty—
Mark start, invoke free,
Poll 'til collection's decree,
Cold recommendations, no longer iffy! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title accurately captures the main feature: async recommendation refresh via scheduler Lambda invocation.
Linked Issues check	✅ Passed	PR implements async collection via Lambda invoke and MarkCollectionStarted coordination, addressing `#257`'s timeout and cold-start collection blocking issues.
Out of Scope Changes check	✅ Passed	All changes are scoped to async refresh implementation: handler, config store methods, database migration, and mock updates. No unrelated modifications present.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue-257-async-refresh

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cristim · 2026-05-04T14:04:14Z

@coderabbitai full review

coderabbitai · 2026-05-04T14:04:21Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

internal/api/handler_test.go (1)

498-528: ⚡ Quick win

Make this refresh-path test deterministic instead of accepting both 200 and 202.

The broad status assertion can mask regressions. Pin the branch via env setup and assert one exact outcome.

💡 Suggested test hardening

 func TestHandler_HandleRequest_RefreshRecommendations(t *testing.T) {
 	ctx := context.Background()
+	t.Setenv("SCHEDULER_LAMBDA_ARN", "") // force sync fallback path for this test
 	mockScheduler := new(MockScheduler)
 	mockAuth := new(MockAuthService)
 	mockStore := new(MockConfigStore)
@@
-	// 202 in async (Lambda) mode; handler degrades to a synchronous collect
-	// when SCHEDULER_LAMBDA_ARN is unset (the test default), which still
-	// returns 200 because the response is fully populated by the time we reply.
-	assert.Contains(t, []int{200, 202}, resp.StatusCode)
+	assert.Equal(t, 200, resp.StatusCode)
+	mockScheduler.AssertCalled(t, "CollectRecommendations", mock.Anything)
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_test.go` around lines 498 - 528, The test currently
accepts either 200 or 202 which hides branch regressions; make it deterministic
by explicitly setting the SCHEDULER_LAMBDA_ARN environment before creating the
Handler so the code follows the desired path, then assert the single expected
status code. For example, to pin the synchronous path, ensure
SCHEDULER_LAMBDA_ARN is unset (or set to an empty string) prior to calling
Handler.HandleRequest and replace the assert.Contains on resp.StatusCode with an
assert.Equal for 200; to pin the asynchronous path set SCHEDULER_LAMBDA_ARN to a
non-empty value and assert 202. Reference Handler.HandleRequest and the
mockScheduler/mockStore setup when applying the env change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/handler_recommendations_refresh.go`:
- Around line 172-186: MarkCollectionStarted currently discards the boolean "ok"
which indicates if this caller won the race; capture its return (ok, err) from
h.config.MarkCollectionStarted(ctx) and if ok == false simply return
h.config.GetRecommendationsFreshness(ctx) without calling h.asyncInvokeSelf or
ClearCollectionStarted. Only if ok == true should you call
h.asyncInvokeSelf(ctx, schedulerARN); if that invoke fails then call
h.config.ClearCollectionStarted(ctx) to roll back and return the error. This
ensures you don't redundantly invoke asyncInvokeSelf or accidentally clear
another caller's in-flight marker; refer to MarkCollectionStarted,
asyncInvokeSelf, ClearCollectionStarted, and GetRecommendationsFreshness to
locate the changes.
- Around line 102-114: The async branch currently uses a stale freshness
snapshot read before calling MarkCollectionStarted, so LastCollectionStartedAt
is nil and startedAt falls back to now; after a successful async invoke you
should re-fetch freshness (same way the sync path does) and use
freshness.LastCollectionStartedAt if non-nil to set StartedAt in the
RefreshResponse (update the code around the async invoke to call the freshness
read function again and mirror the sync logic that assigns startedAt, preserving
use of MarkCollectionStarted and returning the refreshed LastCollectedAt and
StartedAt values).

In `@internal/api/router.go`:
- Around line 341-342: The route for /api/recommendations/refresh is still
protected by AuthAdmin at registration, so non-admins are blocked before
reaching Router.refreshRecommendationsHandler which calls
h.postRefreshRecommendations (and performs requirePermission(view,
recommendations)); update the route registration to use a non-admin auth policy
(e.g., AuthUser or the equivalent) or remove the Admin-only guard so the
handler's permission check can run, ensuring the registration change references
the same route path and uses the AuthUser/AuthNone constant instead of
AuthAdmin.

In `@internal/scheduler/scheduler.go`:
- Around line 153-157: Deferred cleanup currently captures and reuses the
incoming request context (ctx) in the defer call to
s.clearCollectionStartedBestEffort, which can be canceled before defer runs and
prevents clearing last_collection_started_at; change the defer to capture a
non-cancelable context (e.g. context.Background() or a created context with its
own timeout) and pass that into s.clearCollectionStartedBestEffort so the
cleanup will run even if the original ctx is canceled — update the defer
statement where s.clearCollectionStartedBestEffort is registered (and any
similar defer sites around that block) to use the new background/independent
context instead of ctx.
- Around line 143-149: Current unconditional clearing of
last_collection_started_at can erase a concurrent run’s marker; change to an
ownership-aware clear by having MarkCollectionStarted return an owner identifier
(e.g., a UUID token or the started_at timestamp) and then perform a
compare-and-clear when exiting: only clear last_collection_started_at if the
stored value matches the returned owner token/started_at. Update the scheduler
exit/cleanup paths that currently clear last_collection_started_at (the
unconditional clear code around the exit logic referenced plus the similar spots
at the other mentioned locations) to use this compare-and-clear check so
overlapping cron and async runs don’t wipe each other’s markers.

---

Nitpick comments:
In `@internal/api/handler_test.go`:
- Around line 498-528: The test currently accepts either 200 or 202 which hides
branch regressions; make it deterministic by explicitly setting the
SCHEDULER_LAMBDA_ARN environment before creating the Handler so the code follows
the desired path, then assert the single expected status code. For example, to
pin the synchronous path, ensure SCHEDULER_LAMBDA_ARN is unset (or set to an
empty string) prior to calling Handler.HandleRequest and replace the
assert.Contains on resp.StatusCode with an assert.Equal for 200; to pin the
asynchronous path set SCHEDULER_LAMBDA_ARN to a non-empty value and assert 202.
Reference Handler.HandleRequest and the mockScheduler/mockStore setup when
applying the env change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bab35008-6b17-4228-bbf9-3d56361a8b13

📥 Commits

Reviewing files that changed from the base of the PR and between c8c3ba9 and e0f647a.

📒 Files selected for processing (17)

internal/analytics/collector_test.go
internal/api/handler.go
internal/api/handler_recommendations_refresh.go
internal/api/handler_test.go
internal/api/mocks_test.go
internal/api/router.go
internal/config/interfaces.go
internal/config/store_postgres_recommendations.go
internal/config/types.go
internal/database/postgres/migrations/000047_recommendations_state_started_at.down.sql
internal/database/postgres/migrations/000047_recommendations_state_started_at.up.sql
internal/purchase/mocks_test.go
internal/scheduler/scheduler.go
internal/scheduler/scheduler_overrides_test.go
internal/scheduler/scheduler_suppressions_test.go
internal/scheduler/scheduler_test.go
internal/server/test_helpers_test.go

coderabbitai · 2026-05-04T14:09:19Z

+	// Re-read started_at (set by MarkCollectionStarted) for the 202 body.
+	// In the synchronous path this will be NULL (cleared by the defer), so we
+	// fall back to now as a reasonable sentinel.
+	now := time.Now().UTC()
+	startedAt := now
+	if freshness.LastCollectionStartedAt != nil {
+		startedAt = *freshness.LastCollectionStartedAt
+	}
+
+	return &RefreshResponse{
+		StartedAt:       startedAt,
+		LastCollectedAt: freshness.LastCollectedAt,
+	}, nil


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stale freshness in async path yields incorrect started_at.

In the async branch, freshness was read at line 62 before MarkCollectionStarted was called. Consequently, freshness.LastCollectionStartedAt will be nil (or stale), and startedAt always falls back to the local now sentinel rather than the actual database timestamp.

The sync path correctly re-reads freshness (lines 96-99) after collection. The async path should do the same after a successful invoke.

Proposed fix: re-read freshness after async invoke

} else { // Non-Lambda (HTTP) mode: collect synchronously. ClearCollectionStarted // is called inside CollectRecommendations via the defer, so we don't // need to call it here. if _, collectErr := h.scheduler.CollectRecommendations(ctx); collectErr != nil { return nil, fmt.Errorf("collection failed: %w", collectErr) } + } + + // Re-read freshness to capture the actual started_at / collected_at values. + freshness, err = h.config.GetRecommendationsFreshness(ctx) + if err != nil { + return nil, fmt.Errorf("failed to read freshness: %w", err) + } - // Re-read freshness after synchronous collect so the response reflects - // the actual collection time. - freshness, err = h.config.GetRecommendationsFreshness(ctx) - if err != nil { - return nil, fmt.Errorf("failed to read freshness after collect: %w", err) - } - } // Re-read started_at (set by MarkCollectionStarted) for the 202 body. - // In the synchronous path this will be NULL (cleared by the defer), so we - // fall back to now as a reasonable sentinel. + // In the synchronous path this may be NULL (cleared by defer), so fall back + // to now as a reasonable sentinel. now := time.Now().UTC()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/api/handler_recommendations_refresh.go` around lines 102 - 114, The async branch currently uses a stale freshness snapshot read before calling MarkCollectionStarted, so LastCollectionStartedAt is nil and startedAt falls back to now; after a successful async invoke you should re-fetch freshness (same way the sync path does) and use freshness.LastCollectionStartedAt if non-nil to set StartedAt in the RefreshResponse (update the code around the async invoke to call the freshness read function again and mirror the sync logic that assigns startedAt, preserving use of MarkCollectionStarted and returning the refreshed LastCollectedAt and StartedAt values).

coderabbitai · 2026-05-04T14:09:19Z

+	if schedulerARN != "" {
+		// Atomic mark. If another request already marked it, that's fine —
+		// we just return the current freshness.
+		_, err := h.config.MarkCollectionStarted(ctx)
+		if err != nil {
+			return nil, fmt.Errorf("failed to mark cold-start collection: %w", err)
+		}
+		if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil {
+			// Best-effort: undo the mark so the next request can try again.
+			_ = h.config.ClearCollectionStarted(ctx)
+			return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr)
+		}
+		// Re-read freshness to return the started_at value.
+		return h.config.GetRecommendationsFreshness(ctx)
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Discarded ok from MarkCollectionStarted causes redundant invokes and unsafe rollback.

The boolean return indicating whether this caller won the race is discarded. When ok == false (another collection is already in-flight):

asyncInvokeSelf is called anyway — wasteful/duplicate Lambda invocation.

If the invoke fails, ClearCollectionStarted clears the other caller's in-flight marker, potentially allowing premature re-triggering.

The comment at line 173-174 acknowledges the scenario but the code doesn't act on it.

Proposed fix: respect the race result

if schedulerARN != "" { // Atomic mark. If another request already marked it, that's fine — // we just return the current freshness. - _, err := h.config.MarkCollectionStarted(ctx) + ok, err := h.config.MarkCollectionStarted(ctx) if err != nil { return nil, fmt.Errorf("failed to mark cold-start collection: %w", err) } - if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil { - // Best-effort: undo the mark so the next request can try again. - _ = h.config.ClearCollectionStarted(ctx) - return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr) + if ok { + if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil { + // Best-effort: undo the mark so the next request can try again. + _ = h.config.ClearCollectionStarted(ctx) + return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr) + } } // Re-read freshness to return the started_at value. return h.config.GetRecommendationsFreshness(ctx) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/api/handler_recommendations_refresh.go` around lines 172 - 186, MarkCollectionStarted currently discards the boolean "ok" which indicates if this caller won the race; capture its return (ok, err) from h.config.MarkCollectionStarted(ctx) and if ok == false simply return h.config.GetRecommendationsFreshness(ctx) without calling h.asyncInvokeSelf or ClearCollectionStarted. Only if ok == true should you call h.asyncInvokeSelf(ctx, schedulerARN); if that invoke fails then call h.config.ClearCollectionStarted(ctx) to roll back and return the error. This ensures you don't redundantly invoke asyncInvokeSelf or accidentally clear another caller's in-flight marker; refer to MarkCollectionStarted, asyncInvokeSelf, ClearCollectionStarted, and GetRecommendationsFreshness to locate the changes.

coderabbitai · 2026-05-04T14:09:19Z

 func (r *Router) refreshRecommendationsHandler(ctx context.Context, req *events.LambdaFunctionURLRequest, params map[string]string) (any, error) {
-	return r.h.scheduler.CollectRecommendations(ctx)
+	return r.h.postRefreshRecommendations(ctx, req)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh handler is permission-aware, but route-level auth still blocks non-admin users.

At Line 342, refresh now routes through postRefreshRecommendations, which performs requirePermission(view, recommendations). However, /api/recommendations/refresh is still registered with default AuthAdmin (Line 92), so non-admin users never reach this handler.

💡 Proposed fix

- {ExactPath: "/api/recommendations/refresh", Method: "POST", Handler: r.refreshRecommendationsHandler}, + {ExactPath: "/api/recommendations/refresh", Method: "POST", Handler: r.refreshRecommendationsHandler, Auth: AuthUser},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/api/router.go` around lines 341 - 342, The route for /api/recommendations/refresh is still protected by AuthAdmin at registration, so non-admins are blocked before reaching Router.refreshRecommendationsHandler which calls h.postRefreshRecommendations (and performs requirePermission(view, recommendations)); update the route registration to use a non-admin auth policy (e.g., AuthUser or the equivalent) or remove the Admin-only guard so the handler's permission check can run, ensuring the registration change references the same route path and uses the AuthUser/AuthNone constant instead of AuthAdmin.

coderabbitai · 2026-05-04T14:09:19Z

+// Bookkeeping: always clears last_collection_started_at on exit (success or
+// failure) so the frontend polling loop can detect completion. The scheduler
+// is invoked either by the cron EventBridge rule or by an async self-invoke
+// from the POST /api/recommendations/refresh handler. In the async case,
+// MarkCollectionStarted has already set last_collection_started_at; the
+// cron case leaves it NULL (no async-invoke bookkeeping for cron runs, which
+// are expected and not user-triggered).


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unconditional clear can erase another in-flight run’s marker.

Because every scheduler run clears last_collection_started_at on exit, an overlapping cron invocation can clear the marker for a user-triggered async run before that run completes. That weakens the 409 guard and can end frontend polling early.

A robust fix is to move from plain timestamp clearing to ownership-aware clear (e.g., invocation token or compare-and-clear on expected started_at value).

Also applies to: 153-157, 254-257

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduler/scheduler.go` around lines 143 - 149, Current unconditional clearing of last_collection_started_at can erase a concurrent run’s marker; change to an ownership-aware clear by having MarkCollectionStarted return an owner identifier (e.g., a UUID token or the started_at timestamp) and then perform a compare-and-clear when exiting: only clear last_collection_started_at if the stored value matches the returned owner token/started_at. Update the scheduler exit/cleanup paths that currently clear last_collection_started_at (the unconditional clear code around the exit logic referenced plus the similar spots at the other mentioned locations) to use this compare-and-clear check so overlapping cron and async runs don’t wipe each other’s markers.

coderabbitai · 2026-05-04T14:09:19Z

+	// Always clear last_collection_started_at on exit so the frontend knows
+	// the collection has finished. Extracted into a helper to keep this
+	// function under the cyclomatic-complexity gate.
+	defer s.clearCollectionStartedBestEffort(ctx)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deferred marker clear should not reuse the request context.

At Line 156, the deferred cleanup uses ctx. If collection ends due deadline/cancel, Line 255 executes with a canceled context and fails immediately, leaving last_collection_started_at stuck until TTL recovery.

💡 Proposed fix

- defer s.clearCollectionStartedBestEffort(ctx) + defer s.clearCollectionStartedBestEffort() -func (s *Scheduler) clearCollectionStartedBestEffort(ctx context.Context) { - if err := s.config.ClearCollectionStarted(ctx); err != nil { +func (s *Scheduler) clearCollectionStartedBestEffort() { + clearCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + if err := s.config.ClearCollectionStarted(clearCtx); err != nil { logging.Errorf("failed to clear collection started: %v", err) } }

Also applies to: 254-257

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduler/scheduler.go` around lines 153 - 157, Deferred cleanup currently captures and reuses the incoming request context (ctx) in the defer call to s.clearCollectionStartedBestEffort, which can be canceled before defer runs and prevents clearing last_collection_started_at; change the defer to capture a non-cancelable context (e.g. context.Background() or a created context with its own timeout) and pass that into s.clearCollectionStartedBestEffort so the cleanup will run even if the original ctx is canceled — update the defer statement where s.clearCollectionStartedBestEffort is registered (and any similar defer sites around that block) to use the new background/independent context instead of ctx.

Folds the previously-out-of-scope follow-ups into PR #257 itself so deployed verification doesn't require a second PR: Terraform (compute/aws/lambda): - new aws_iam_role_policy.async_self_invoke grants lambda:InvokeFunction scoped to the API Lambda's own ARN, so the async-refresh handler can fire-and-forget self-invoke the scheduler-task code path with InvocationType=Event - new SCHEDULER_LAMBDA_ARN env var derived from aws_partition + aws_caller_identity + var.region + var.stack_name so the variable is the API Lambda's own ARN without a self- reference cycle on aws_lambda_function.main.arn - data sources aws_partition.current and aws_caller_identity.current added at the top of the module Frontend (api/recommendations.ts + freshness.ts): - RefreshRecommendationsResult: started_at + last_collected_at (matches the new RefreshResponse from the backend handler). Pre-#257 fields (recommendations / total_savings / providers) kept as optional during the transition so legacy fixtures still type-check. - RecommendationsFreshness: new optional last_collection_started_at so polling can detect when the in-flight async collection has finished (the backend clears it on both success and failure paths). - freshness.ts refresh-button click handler now polls /recommendations/freshness every 5s after the POST returns, waiting for last_collection_started_at to clear (or for last_collected_at to advance past started_at). 10-min safety cap so a stuck collection never wedges the UI. Optional started_at in the response degrades gracefully to "synchronous, exit immediately" so pre-#257 mocks still pass. Tests: 1,453 frontend tests pass; existing freshness test suite unmodified and green.

cristim · 2026-05-04T14:17:08Z

@coderabbitai full review

I folded the originally-out-of-scope follow-ups into this PR:

Terraform IAM grant + env var
Frontend type updates + 5s polling loop after the POST returns

3,614 Go tests + 1,453 frontend tests + all pre-commit hooks (terraform fmt/validate/tflint, frontend build, etc.) green locally.

coderabbitai · 2026-05-04T14:17:23Z

Tip

For best results, initiate chat on the files or code changes.

@cristim Thanks for the update! I'll kick off a fresh full review covering all the folded-in additions:

Terraform: IAM grant for API Lambda → scheduler Lambda invoke + SCHEDULER_LAMBDA_ARN env var wiring
Frontend: type updates for the new RefreshResponse shape + the 5-second polling loop after the POST /api/recommendations/refresh returns

Solid test coverage numbers — will review all the changes thoroughly.

[full_review]

…sh (#257) Five actionable + one nitpick from CR review of e0f647a..170b18e: 1. handler_recommendations_refresh.go::triggerColdStartCollect now respects the bool from MarkCollectionStarted: ok=false means another caller already won the race; we return current freshness without a second async-invoke and without ClearCollectionStarted (which would wipe the other caller's marker). 2. postRefreshRecommendations re-reads freshness AFTER the async invoke so LastCollectionStartedAt reflects the value MarkCollectionStarted just wrote. Previously the freshness snapshot was taken before the mark, so StartedAt always fell back to time.Now() instead of the persisted value. The collect- trigger logic was extracted into runMarkedCollection so the parent stays under the cyclomatic-complexity gate. 3. router.go: /api/recommendations/refresh now uses Auth: AuthUser (was the AuthAdmin default). The handler still enforces requirePermission(view, recommendations) so view-only roles can refresh data they're allowed to see; admin-only would have blocked them at the router layer. 4. scheduler.go: defer-clear of last_collection_started_at now uses a fresh context.Background() with a 5s timeout. The request ctx could already be canceled by the time the defer runs (slow collect outliving caller deadline), which would leave the marker in place until the 5-min auto-recovery window kicked in. 5. handler_test.go: TestHandler_HandleRequest_RefreshRecommendations now pins the synchronous-fallback branch via t.Setenv() and asserts a single 200 (was assert.Contains([200, 202])). Async- branch coverage moves to handler-level integration tests that inject lambdaInvoker directly. CR finding #6 (compare-and-clear with owner token to avoid wiping a concurrent run's marker) deliberately deferred — the existing 5-min auto-recovery window already handles overlapping runs and the owner-token refactor is a larger architectural change. Filed as follow-up in the PR thread.

cristim · 2026-05-04T14:27:56Z

CodeRabbit pass-1 response

Fixed (pushed in 49bc90d):

triggerColdStartCollect now respects MarkCollectionStarted's ok bool (handler_recommendations_refresh.go:172-186). When another caller already won the race, return current freshness without a second async-invoke and without ClearCollectionStarted.
Async branch re-reads freshness AFTER MarkCollectionStarted (handler_recommendations_refresh.go:102-114) so LastCollectionStartedAt reflects the persisted value. Trigger logic extracted into runMarkedCollection to keep the parent under the cyclomatic-complexity gate.
/api/recommendations/refresh now uses Auth: AuthUser (router.go:341-342); the handler still enforces requirePermission(view, recommendations) so view-only roles can refresh data they're allowed to see.
Defer-clear uses fresh context.Background() with 5s timeout (scheduler.go:153-157) so the marker is cleared even if the request ctx was canceled mid-collect.
TestHandler_HandleRequest_RefreshRecommendations pins the sync-fallback branch via t.Setenv() and asserts a single 200 (was assert.Contains([200, 202])).

Dismissed with justification:

scheduler.go:143-149 compare-and-clear with owner token (priority/major): the existing 5-min auto-recovery window already handles overlapping cron + async runs. Owner-token bookkeeping (UUID per run, compare-and-clear on exit) is a larger architectural change — adding an owner parameter to MarkCollectionStarted/ClearCollectionStarted, a column migration, and matching mock updates across 5 packages. The narrow fix-now value is small (the practical race window is <60s of overlap given cron cadence), so deferring to a follow-up issue rather than expanding this PR's scope.

@coderabbitai full review

coderabbitai · 2026-05-04T14:28:21Z

Tip

For best results, initiate chat on the files or code changes.

@cristim Thanks for the detailed pass-1 response and the follow-through on the fixes. Here's a quick acknowledgement before the next pass:

Accepted fixes (49bc90d):

✅ triggerColdStartCollect — correctly short-circuits when ok == false from MarkCollectionStarted, avoiding a spurious second async-invoke and a mismatched ClearCollectionStarted.
✅ Freshness re-read post-MarkCollectionStarted — LastCollectionStartedAt in the 202 response now reflects the persisted DB timestamp. The runMarkedCollection extraction also helps with cyclomatic complexity.
✅ Router auth level (AuthUser) — appropriate since requirePermission is the real enforcement gate.
✅ Defer-clear with context.Background() + 5s timeout — correct defensive pattern; avoids dropped cleanup on a cancelled request context.
✅ Test pinning to sync-fallback path — deterministic assertion on 200 is cleaner than the Contains([200, 202]) hedge.

Dismissed (owner token / compare-and-clear): The 5-minute auto-recovery window + the narrow practical overlap window (<60 s given cron cadence) is a reasonable justification for deferring the UUID-based owner bookkeeping to a follow-up. If you'd like me to open a tracking issue for that architectural follow-up, just say the word.

Kicking off a fresh full review now.

[full_review]

…up) (#265) The async refresh path landed in PR #260 but the Lambda self-invoke payload didn't match what the receiving Lambda expects. The shipped payload was: {"event": "scheduled_recommendations"} But internal/server/handler.go::ScheduledEvent has no "event" field — it has Action, Source, DetailType, Detail. Unmarshalling that JSON gives Action="", and ParseScheduledEvent rejects every event with: unknown scheduled task action: "" Verified live in /aws/lambda/cudly-dev-426fc8af-api request 3befb380-6d46-47e1-954b-69a2b59ea90a: 16:49:40 Received unknown event (size: 37 bytes) 16:49:40 Unknown event type, treating as scheduled event 16:49:40 {"errorMessage":"failed to parse scheduled event: unknown scheduled task action: \"\""} The 37-byte event = the {"event":"scheduled_recommendations"} string exactly. The async self-invoke is firing successfully but every invocation is rejected, so the "Refreshing..." banner never clears and collection never runs — same user-visible symptom (refresh times out + no fresh data) as if the env var or IAM grant was missing. Fix: send the payload shape the dispatcher actually accepts: {"source": "aws.events", "action": "collect_recommendations"} action="collect_recommendations" maps to TaskCollectRecommendations in the ParseScheduledEvent switch. source="aws.events" makes detectLambdaEventType in internal/server/lambda.go classify the invoke consistently with EventBridge cron deliveries that already exercise this code path.

#267 #268) (#269) * perf(aws): parallelize 5 sequential service calls in GetAllRecommendations (closes #266) The dispatcher in providers/aws/recommendations/client.go::GetAllRecommendations ran 5 service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) one after another with an artificial 100ms stagger between each. Each per- service call hits AWS Cost Explorer's GetReservationPurchaseRecommendation, which routinely takes multiple seconds. Total wall time per account scaled as sum(per-service) instead of max(per-service). Mirrors the Azure parallelisation in providers/azure/recommendations.go (closes #258, commit b10326c). The five calls have no inter-service dependency, so they can run concurrently under errgroup.WithContext. Each goroutine captures its own error in a closure-scoped variable and returns nil to the group so one service's failure does not cancel siblings (matching the previous loop's `continue`-on-error tolerance). Results are merged in the canonical order EC2 → RDS → ElastiCache → OpenSearch → Redshift after Wait() so any order-sensitive callers stay stable. After Wait(), ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.DeadlineExceeded rather than being swallowed by the per-service error-isolation goroutines (which all return nil). The 100ms artificial stagger is removed — it was a sequential-mode rate- limit hack that adds 400ms per account for no benefit under concurrent fan-out (Cost Explorer rate limits apply per request, not per consecutive request). Behaviour change vs the previous sequential loop: per-service errors are now logged at WARN via the new mergeServiceResults helper. The previous loop swallowed them silently with a bare `continue`, leaving operators no signal when a single service was misbehaving. The serviceResult struct + mergeServiceResults helper extraction also keeps GetAllRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15). New test TestGetAllRecommendations_PropagatesContextCancellation pins the contract: calling GetAllRecommendations with a pre-cancelled context must return context.Canceled. Mirrors the Azure equivalent at providers/azure/recommendations_test.go::TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation. Expected end-to-end after this change: ~max(individual call durations) instead of sum + 4×100ms stagger. * perf(gcp): parallelize region × service nested loops in GetRecommendations (closes #267) The dispatcher in providers/gcp/recommendations.go::GetRecommendations was doubly serial — `for region { for service { ... } }` — fetching Compute Engine then Cloud SQL recommendations one region at a time. With ~30+ GCP regions × 2 services, that's 60+ sequential RPCs per account against the project-scoped Recommender API. Even at 200ms per call this is a 12s+ floor; in practice most calls are slower. After #260 (async self- invoke) this no longer triggers user-facing 502s, but it directly inflates the scheduler Lambda's runtime and the in-flight freshness- banner window. Two-level concurrent fan-out, mirroring the Azure pattern in providers/azure/recommendations.go (closes #258, commit b10326c) and the AWS service-loop parallelisation (closes #266): - Outer: errgroup over regions, capped at gcpRegionConcurrency() (CUDLY_GCP_REGION_PARALLELISM, default 10) to stay polite to the project-scoped Recommender API quota. Lower than the existing account-level CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) because each account already gets its own goroutine slot from fanOutPerAccount; cumulative concurrency would multiply otherwise. - Inner: within each region's goroutine, Compute + Cloud SQL run as two further goroutines under a per-region sub-errgroup, so per- region cost is max(compute, sql) rather than compute + sql. Each goroutine returns nil to its errgroup so a single per-(region, service) failure does not cancel siblings — preserves the previous silent-skip-on-err shape semantically. Behaviour change vs the previous nested for-loops: per-(region, service) errors that were silently swallowed by the previous `if err == nil { ... }` shape are now logged at WARN with the region+service tag so misconfigured projects are diagnosable. Errors still don't propagate to callers (preserving silent-skip), only the log severity changes from "invisible" to WARN. After the outer Wait(), ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.Deadline- Exceeded rather than being swallowed by the per-region error-isolation goroutines (which all return nil). Result merging walks regions in sorted order and appends compute then sql per region — output is deterministic independent of GCP API region-list ordering or goroutine completion order. The collectRegion helper extraction also keeps GetRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15) after the post-Wait ctx.Err() block was added. Tests added: - TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation pins the contract that a pre-cancelled ctx surfaces as context.Canceled. - TestGCPRegionConcurrency pins the env-knob parser semantics (unset/positive/non-numeric/zero/negative/explicit-unset). Expected end-to-end after this change: ceil(N_regions / cap) × max(compute, sql) instead of N_regions × (compute + sql) — for a project with 30 regions and a cap of 10, that's a ~6× speedup before counting the per-region service- level parallelism win. * perf(scheduler): parallelize provider collection loop AWS/Azure/GCP (closes #268) The provider loop in scheduler.go::CollectRecommendations ran AWS, Azure and GCP collection sequentially: total wall time = sum(per-provider) rather than max(per-provider). After #260 (async self-invoke) the user- facing 502 on POST /api/recommendations/refresh is gone, but the scheduler Lambda itself still pays the serial cost — visible as a longer "collection in flight" freshness-banner window and reduced cron-tick headroom. Provider-level fan-out under errgroup, mirroring the Azure and AWS service-loop parallelisations (#258 / #266). Each per-provider goroutine returns nil to the group so a single provider's error does not cancel its siblings — preserves the previous loop's `continue`-on-error semantics. Per-provider results are written into a map under a single mutex (the same pattern as fanOutPerAccount at scheduler.go:393). After Wait, ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.DeadlineExceeded rather than being swallowed by the per-provider error-isolation goroutines (which all return nil). Deterministic merge: walk EnabledProviders in config order to assemble successfulProviders / successfulCollects / allRecommendations / totalSavings / failedProviders. Order is by config, NOT by goroutine completion order or by Go's randomised map iteration — keeps existing tests stable and the freshness-banner / notification-email content predictable. The new collectAllProviders helper extraction also keeps CollectRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15) after the errgroup + post-Wait ctx.Err() block was added. No concurrency cap on the outer fan-out — the universe is at most 3 providers; the per-account fan-out inside each provider is still bounded by CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) and per-region GCP fan-out by CUDLY_GCP_REGION_PARALLELISM (default 10, from #267). Test mocks updated: per-provider mock methods (CreateAndValidateProvider, ListCloudAccounts, GetRecommendationsClient, GetAllRecommendations) are now invoked from inside the errgroup goroutine, so they receive the errgroup-derived gctx rather than the caller's ctx. Mock setups changed from literal ctx to mock.Anything where the call sits inside the goroutine path; calls made before/after the fan-out (GetGlobalConfig, SendNewRecommendationsNotification, persistCollection writes) still pin literal ctx because they run on the caller's context unchanged. New TestScheduler_CollectRecommendations_ParallelProviders pins three contracts: 1. successfulProviders ordering matches EnabledProviders config order, not goroutine completion order or alphabetical. Asserted with a deliberately non-alphabetical config (["gcp", "aws", "azure"]) where AWS fails (ambient credentials → factory mock returns error) and Azure/GCP succeed empty (no enabled accounts) — the resulting successfulProviders must be ["gcp", "azure"], distinguishable from any of: input order, alphabetical, or arbitrary map iteration. 2. ctx cancellation propagates: a pre-cancelled ctx surfaces as context.Canceled (not a "successful but empty" CollectResult). Expected wall-time impact: CollectRecommendations drops from sum(per-provider) to max(per-provider). With AWS+Azure+GCP all enabled and Azure typically the slowest at ~15-20s post-#258, total drops from ~30-45s to ~max(15-20s).

cristim added priority/p2 Backlog-worthy severity/medium Moderate harm urgency/this-quarter Within the quarter impact/many Affects most users effort/m Days type/bug Defect triaged Item has been triaged labels May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

cristim mentioned this pull request May 4, 2026

perf(api): owner-token compare-and-clear for last_collection_started_at to handle overlapping cron + async runs #261

Open

cristim merged commit 7eb77d1 into feat/multicloud-web-frontend May 4, 2026
5 checks passed

cristim mentioned this pull request May 4, 2026

fix(api): correct async self-invoke payload shape so collection actually runs #265

Merged

5 tasks

Conversation

cristim commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bookkeeping additions

Folded-in followups (originally listed as out-of-scope)

Test plan

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

cristim commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

cristim commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

cristim commented May 4, 2026

CodeRabbit pass-1 response

Uh oh!

coderabbitai Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cristim commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading