Skip to content

feat(api): async recommendation refresh via scheduler Lambda invoke (closes #257)#260

Merged
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
fix/issue-257-async-refresh
May 4, 2026
Merged

feat(api): async recommendation refresh via scheduler Lambda invoke (closes #257)#260
cristim merged 3 commits into
feat/multicloud-web-frontendfrom
fix/issue-257-async-refresh

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 4, 2026

Summary

Async recommendation refresh end-to-end (closes #257). The 502 on user-triggered refresh and on first-load was caused by the API Lambda doing the full provider fan-out synchronously and exceeding its 60s timeout — Azure alone takes >60s (#258 is the orthogonal parallelization quick-win).

This PR splits the request path so the user-facing API returns immediately:

POST /api/recommendations/refresh:
  1. requirePermission(view, recommendations)
  2. read freshness for the response body
  3. atomically MarkCollectionStarted
       (returns false → 409 if another collection is in flight, with a
        5-min auto-recovery window in case the scheduler crashed mid-run)
  4. self-invoke the API Lambda via lambda:InvokeFunction with
     InvocationType=Event, hitting the EventBridge "scheduled" path
     (CUDly has one Lambda; the scheduler is just the cron-task code
     path inside it)
  5. return 202 with started_at + previous last_collected_at

The same async-invoke path is wired into the cold-start branch of GET /api/recommendations so first-load no longer blocks on a 60s provider fan-out.

Bookkeeping additions

StoreInterface (Postgres + 5 test mocks across analytics/api/purchase/scheduler/server packages):

MarkCollectionStarted(ctx) (bool, error)   // race-safe set; 5-min recovery
ClearCollectionStarted(ctx) error          // called by scheduler on exit

Migration 000047 adds last_collection_started_at on recommendations_state (default NULL).

Folded-in followups (originally listed as out-of-scope)

After landing the backend, I folded in the deploy + UX pieces in a second commit so the PR is end-to-end functional and verifiable in deployed mode:

Terraform (terraform/modules/compute/aws/lambda/main.tf):

  • aws_iam_role_policy.async_self_invoke — grants lambda:InvokeFunction scoped to the API Lambda's own ARN
  • SCHEDULER_LAMBDA_ARN env var derived from aws_partition + aws_caller_identity + var.region + var.stack_name (avoids self-reference cycle on aws_lambda_function.main.arn)
  • Data sources aws_partition.current + aws_caller_identity.current

Frontend:

  • RefreshRecommendationsResult and RecommendationsFreshness types updated for the new payload shapes (legacy fields kept optional during transition so existing fixtures still type-check)
  • freshness.ts refresh-click handler now polls /recommendations/freshness every 5 s after the POST returns, waiting for last_collection_started_at to clear (or for last_collected_at to advance past started_at). 10-min safety cap.

Test plan

  • go vet ./... clean
  • go test ./internal/... — 3,614 tests pass
  • npm test (frontend) — 1,453 tests pass
  • tsc --noEmit clean
  • Pre-commit hooks (gocyclo, gosec, trivy, secrets scan, gofmt, govet, terraform fmt+validate+tflint, frontend build+tests) all green
  • CI green on push
  • Deployed verification — bump RDS Proxy / Lambda timeout if Azure still > 60s (covered by perf(azure): parallelize the 5 sequential service calls in GetRecommendations #258 parallelization)

Closes #257.

…loses #257)

The 502 on user-triggered refresh and on first-load (cold-start) was
caused by the API Lambda doing the full provider-fan-out
synchronously and exceeding its 60s timeout — Azure alone takes
>60s. The collection completes successfully in the scheduler Lambda
(longer timeout), but the API Lambda was the wrong place to do the
work.

This change splits the request path so the user-facing API returns
immediately:

  POST /api/recommendations/refresh:
   1. requirePermission(view, recommendations)
   2. read freshness for the response body
   3. atomically MarkCollectionStarted (returns false → 409 if another
      collection is already in flight, with a 5-minute auto-recovery
      window in case the scheduler crashed mid-run and never cleared)
   4. async-invoke the scheduler Lambda via lambda:InvokeFunction with
      InvocationType=Event (fire-and-forget) when SCHEDULER_LAMBDA_ARN
      is set; otherwise fall back to synchronous CollectRecommendations
      so local/HTTP-mode dev still works
   5. return 202 with started_at + previous last_collected_at so the
      frontend can render "Last refreshed N min ago — collecting…"

The same async-invoke path is wired into the cold-start branch of
GET /api/recommendations so first-load no longer blocks on a 60s
provider fan-out.

Bookkeeping additions in StoreInterface (Postgres + 5 test mocks
across analytics/api/purchase/scheduler/server packages):

  MarkCollectionStarted(ctx) (bool, error)   // race-safe set
  ClearCollectionStarted(ctx) error          // called by scheduler at end

Migration 000047 adds the last_collection_started_at column on
recommendations_state (default NULL).

Out of scope for this PR (filed as separate triaged follow-ups when
needed): the API Lambda's IAM grant for lambda:InvokeFunction on the
scheduler ARN (Terraform), and the frontend refresh button + polling
banner. The backend handler degrades to synchronous-collect when
SCHEDULER_LAMBDA_ARN is unset, so deployed verification is gated on
the IAM + env-var follow-up.
@cristim cristim added priority/p2 Backlog-worthy severity/medium Moderate harm urgency/this-quarter Within the quarter impact/many Affects most users effort/m Days type/bug Defect triaged Item has been triaged labels May 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2e20da7-9472-49bb-bcfa-161cb4d009e4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR enables asynchronous recommendation collection to prevent API Lambda timeouts on cold-start. It adds collection-state tracking via a new last_collection_started_at timestamp, implements a Lambda self-invocation mechanism in a new POST handler, persists in-flight state to the database, and returns HTTP 202 to clients when async collection begins, allowing polling for completion.

Changes

Async Recommendation Collection State Machine

Layer / File(s) Summary
Data Shape
internal/config/types.go
RecommendationsFreshness adds LastCollectionStartedAt *time.Time field to track when async collection begins.
Database Schema
internal/database/postgres/migrations/000047_*
Migration adds nullable last_collection_started_at TIMESTAMPTZ column to recommendations_state table.
Store Interface & Implementation
internal/config/interfaces.go, internal/config/store_postgres_recommendations.go
StoreInterface adds MarkCollectionStarted(ctx) (bool, error) (atomic "won race" check) and ClearCollectionStarted(ctx) error. PostgreSQL implementation: MarkCollectionStarted sets timestamp only if NULL or older than 5 minutes; ClearCollectionStarted nullifies it. GetRecommendationsFreshness and SetRecommendationsCollectionError now read/write last_collection_started_at.
Handler & Lambda Invocation
internal/api/handler.go, internal/api/handler_recommendations_refresh.go
Handler gains lambdaInvoker LambdaInvokerInterface field. New postRefreshRecommendations handler enforces permission, reads freshness, atomically marks collection start (409 on conflict), then invokes scheduler async via SCHEDULER_LAMBDA_ARN or falls back to sync CollectRecommendations. Returns HTTP 202 with RefreshResponse containing started_at and last_collected_at. Supporting asyncInvokeSelf, getLambdaInvoker, and triggerColdStartCollect helpers provide async-invoke abstraction and test injection.
Router Wiring
internal/api/router.go
refreshRecommendationsHandler now dispatches to postRefreshRecommendations instead of directly calling scheduler.CollectRecommendations.
Scheduler Cleanup
internal/scheduler/scheduler.go
CollectRecommendations now defers clearCollectionStartedBestEffort to guarantee last_collection_started_at is cleared on completion (success or failure), allowing frontend polling to reliably detect end-of-collection.
Test Mocks & Assertions
internal/api/mocks_test.go, internal/api/handler_test.go, internal/analytics/collector_test.go, internal/purchase/mocks_test.go, internal/scheduler/scheduler_test.go, internal/scheduler/scheduler_overrides_test.go, internal/scheduler/scheduler_suppressions_test.go, internal/server/test_helpers_test.go
All mock ConfigStore implementations extended with MarkCollectionStarted and ClearCollectionStarted methods (default: return true, nil and nil). TestHandler_HandleRequest_RefreshRecommendations updated to mock freshness reads, collection-start marking, and accept 200 or 202 status.

Sequence Diagram

sequenceDiagram
    participant Client
    participant APIHandler as API Handler
    participant ConfigStore as Config Store
    participant LambdaInvoker as Lambda Invoker
    participant SchedulerLambda as Scheduler Lambda
    
    rect rgba(200, 150, 255, 0.5)
    Note over Client,SchedulerLambda: Async Recommendation Collection (POST /api/recommendations/refresh)
    Client->>APIHandler: POST /api/recommendations/refresh
    APIHandler->>ConfigStore: MarkCollectionStarted()
    alt Collection already in-flight
        ConfigStore-->>APIHandler: false (409 Conflict)
        APIHandler-->>Client: 409 Collection in progress
    else First to acquire lock
        ConfigStore-->>APIHandler: true, nil
        APIHandler->>ConfigStore: GetRecommendationsFreshness()
        ConfigStore-->>APIHandler: freshness
        alt SCHEDULER_LAMBDA_ARN configured
            APIHandler->>LambdaInvoker: asyncInvokeSelf(SCHEDULER_LAMBDA_ARN)
            LambdaInvoker->>SchedulerLambda: Invoke(InvocationTypeEvent, event="scheduled_recommendations")
            SchedulerLambda-->>LambdaInvoker: OK (async, no wait)
            alt Invoke succeeded
                APIHandler-->>Client: 202 {started_at, last_collected_at}
                SchedulerLambda->>ConfigStore: CollectRecommendations() + defer ClearCollectionStarted()
                ConfigStore-->>SchedulerLambda: done
            else Invoke failed
                APIHandler->>ConfigStore: ClearCollectionStarted() [rollback]
                ConfigStore-->>APIHandler: cleared
                APIHandler-->>Client: 500/502 error
            end
        else Fallback: no ARN, sync collect
            APIHandler->>ConfigStore: CollectRecommendations()
            ConfigStore-->>APIHandler: collected
            APIHandler-->>Client: 202 {started_at, last_collected_at}
        end
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

priority/p1, severity/high, urgency/now, type/feat

Poem

A Lambda that timed out at sixty,
Now async, returning quite nifty—
Mark start, invoke free,
Poll 'til collection's decree,
Cold recommendations, no longer iffy! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately captures the main feature: async recommendation refresh via scheduler Lambda invocation.
Linked Issues check ✅ Passed PR implements async collection via Lambda invoke and MarkCollectionStarted coordination, addressing #257's timeout and cold-start collection blocking issues.
Out of Scope Changes check ✅ Passed All changes are scoped to async refresh implementation: handler, config store methods, database migration, and mock updates. No unrelated modifications present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-257-async-refresh

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
internal/api/handler_test.go (1)

498-528: ⚡ Quick win

Make this refresh-path test deterministic instead of accepting both 200 and 202.

The broad status assertion can mask regressions. Pin the branch via env setup and assert one exact outcome.

💡 Suggested test hardening
 func TestHandler_HandleRequest_RefreshRecommendations(t *testing.T) {
 	ctx := context.Background()
+	t.Setenv("SCHEDULER_LAMBDA_ARN", "") // force sync fallback path for this test
 	mockScheduler := new(MockScheduler)
 	mockAuth := new(MockAuthService)
 	mockStore := new(MockConfigStore)
@@
-	// 202 in async (Lambda) mode; handler degrades to a synchronous collect
-	// when SCHEDULER_LAMBDA_ARN is unset (the test default), which still
-	// returns 200 because the response is fully populated by the time we reply.
-	assert.Contains(t, []int{200, 202}, resp.StatusCode)
+	assert.Equal(t, 200, resp.StatusCode)
+	mockScheduler.AssertCalled(t, "CollectRecommendations", mock.Anything)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_test.go` around lines 498 - 528, The test currently
accepts either 200 or 202 which hides branch regressions; make it deterministic
by explicitly setting the SCHEDULER_LAMBDA_ARN environment before creating the
Handler so the code follows the desired path, then assert the single expected
status code. For example, to pin the synchronous path, ensure
SCHEDULER_LAMBDA_ARN is unset (or set to an empty string) prior to calling
Handler.HandleRequest and replace the assert.Contains on resp.StatusCode with an
assert.Equal for 200; to pin the asynchronous path set SCHEDULER_LAMBDA_ARN to a
non-empty value and assert 202. Reference Handler.HandleRequest and the
mockScheduler/mockStore setup when applying the env change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/handler_recommendations_refresh.go`:
- Around line 172-186: MarkCollectionStarted currently discards the boolean "ok"
which indicates if this caller won the race; capture its return (ok, err) from
h.config.MarkCollectionStarted(ctx) and if ok == false simply return
h.config.GetRecommendationsFreshness(ctx) without calling h.asyncInvokeSelf or
ClearCollectionStarted. Only if ok == true should you call
h.asyncInvokeSelf(ctx, schedulerARN); if that invoke fails then call
h.config.ClearCollectionStarted(ctx) to roll back and return the error. This
ensures you don't redundantly invoke asyncInvokeSelf or accidentally clear
another caller's in-flight marker; refer to MarkCollectionStarted,
asyncInvokeSelf, ClearCollectionStarted, and GetRecommendationsFreshness to
locate the changes.
- Around line 102-114: The async branch currently uses a stale freshness
snapshot read before calling MarkCollectionStarted, so LastCollectionStartedAt
is nil and startedAt falls back to now; after a successful async invoke you
should re-fetch freshness (same way the sync path does) and use
freshness.LastCollectionStartedAt if non-nil to set StartedAt in the
RefreshResponse (update the code around the async invoke to call the freshness
read function again and mirror the sync logic that assigns startedAt, preserving
use of MarkCollectionStarted and returning the refreshed LastCollectedAt and
StartedAt values).

In `@internal/api/router.go`:
- Around line 341-342: The route for /api/recommendations/refresh is still
protected by AuthAdmin at registration, so non-admins are blocked before
reaching Router.refreshRecommendationsHandler which calls
h.postRefreshRecommendations (and performs requirePermission(view,
recommendations)); update the route registration to use a non-admin auth policy
(e.g., AuthUser or the equivalent) or remove the Admin-only guard so the
handler's permission check can run, ensuring the registration change references
the same route path and uses the AuthUser/AuthNone constant instead of
AuthAdmin.

In `@internal/scheduler/scheduler.go`:
- Around line 153-157: Deferred cleanup currently captures and reuses the
incoming request context (ctx) in the defer call to
s.clearCollectionStartedBestEffort, which can be canceled before defer runs and
prevents clearing last_collection_started_at; change the defer to capture a
non-cancelable context (e.g. context.Background() or a created context with its
own timeout) and pass that into s.clearCollectionStartedBestEffort so the
cleanup will run even if the original ctx is canceled — update the defer
statement where s.clearCollectionStartedBestEffort is registered (and any
similar defer sites around that block) to use the new background/independent
context instead of ctx.
- Around line 143-149: Current unconditional clearing of
last_collection_started_at can erase a concurrent run’s marker; change to an
ownership-aware clear by having MarkCollectionStarted return an owner identifier
(e.g., a UUID token or the started_at timestamp) and then perform a
compare-and-clear when exiting: only clear last_collection_started_at if the
stored value matches the returned owner token/started_at. Update the scheduler
exit/cleanup paths that currently clear last_collection_started_at (the
unconditional clear code around the exit logic referenced plus the similar spots
at the other mentioned locations) to use this compare-and-clear check so
overlapping cron and async runs don’t wipe each other’s markers.

---

Nitpick comments:
In `@internal/api/handler_test.go`:
- Around line 498-528: The test currently accepts either 200 or 202 which hides
branch regressions; make it deterministic by explicitly setting the
SCHEDULER_LAMBDA_ARN environment before creating the Handler so the code follows
the desired path, then assert the single expected status code. For example, to
pin the synchronous path, ensure SCHEDULER_LAMBDA_ARN is unset (or set to an
empty string) prior to calling Handler.HandleRequest and replace the
assert.Contains on resp.StatusCode with an assert.Equal for 200; to pin the
asynchronous path set SCHEDULER_LAMBDA_ARN to a non-empty value and assert 202.
Reference Handler.HandleRequest and the mockScheduler/mockStore setup when
applying the env change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bab35008-6b17-4228-bbf9-3d56361a8b13

📥 Commits

Reviewing files that changed from the base of the PR and between c8c3ba9 and e0f647a.

📒 Files selected for processing (17)
  • internal/analytics/collector_test.go
  • internal/api/handler.go
  • internal/api/handler_recommendations_refresh.go
  • internal/api/handler_test.go
  • internal/api/mocks_test.go
  • internal/api/router.go
  • internal/config/interfaces.go
  • internal/config/store_postgres_recommendations.go
  • internal/config/types.go
  • internal/database/postgres/migrations/000047_recommendations_state_started_at.down.sql
  • internal/database/postgres/migrations/000047_recommendations_state_started_at.up.sql
  • internal/purchase/mocks_test.go
  • internal/scheduler/scheduler.go
  • internal/scheduler/scheduler_overrides_test.go
  • internal/scheduler/scheduler_suppressions_test.go
  • internal/scheduler/scheduler_test.go
  • internal/server/test_helpers_test.go

Comment on lines +102 to +114
// Re-read started_at (set by MarkCollectionStarted) for the 202 body.
// In the synchronous path this will be NULL (cleared by the defer), so we
// fall back to now as a reasonable sentinel.
now := time.Now().UTC()
startedAt := now
if freshness.LastCollectionStartedAt != nil {
startedAt = *freshness.LastCollectionStartedAt
}

return &RefreshResponse{
StartedAt: startedAt,
LastCollectedAt: freshness.LastCollectedAt,
}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stale freshness in async path yields incorrect started_at.

In the async branch, freshness was read at line 62 before MarkCollectionStarted was called. Consequently, freshness.LastCollectionStartedAt will be nil (or stale), and startedAt always falls back to the local now sentinel rather than the actual database timestamp.

The sync path correctly re-reads freshness (lines 96-99) after collection. The async path should do the same after a successful invoke.

Proposed fix: re-read freshness after async invoke
 	} else {
 		// Non-Lambda (HTTP) mode: collect synchronously. ClearCollectionStarted
 		// is called inside CollectRecommendations via the defer, so we don't
 		// need to call it here.
 		if _, collectErr := h.scheduler.CollectRecommendations(ctx); collectErr != nil {
 			return nil, fmt.Errorf("collection failed: %w", collectErr)
 		}
+	}
+
+	// Re-read freshness to capture the actual started_at / collected_at values.
+	freshness, err = h.config.GetRecommendationsFreshness(ctx)
+	if err != nil {
+		return nil, fmt.Errorf("failed to read freshness: %w", err)
+	}
-		// Re-read freshness after synchronous collect so the response reflects
-		// the actual collection time.
-		freshness, err = h.config.GetRecommendationsFreshness(ctx)
-		if err != nil {
-			return nil, fmt.Errorf("failed to read freshness after collect: %w", err)
-		}
-	}

 	// Re-read started_at (set by MarkCollectionStarted) for the 202 body.
-	// In the synchronous path this will be NULL (cleared by the defer), so we
-	// fall back to now as a reasonable sentinel.
+	// In the synchronous path this may be NULL (cleared by defer), so fall back
+	// to now as a reasonable sentinel.
 	now := time.Now().UTC()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_recommendations_refresh.go` around lines 102 - 114, The
async branch currently uses a stale freshness snapshot read before calling
MarkCollectionStarted, so LastCollectionStartedAt is nil and startedAt falls
back to now; after a successful async invoke you should re-fetch freshness (same
way the sync path does) and use freshness.LastCollectionStartedAt if non-nil to
set StartedAt in the RefreshResponse (update the code around the async invoke to
call the freshness read function again and mirror the sync logic that assigns
startedAt, preserving use of MarkCollectionStarted and returning the refreshed
LastCollectedAt and StartedAt values).

Comment on lines +172 to +186
if schedulerARN != "" {
// Atomic mark. If another request already marked it, that's fine —
// we just return the current freshness.
_, err := h.config.MarkCollectionStarted(ctx)
if err != nil {
return nil, fmt.Errorf("failed to mark cold-start collection: %w", err)
}
if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil {
// Best-effort: undo the mark so the next request can try again.
_ = h.config.ClearCollectionStarted(ctx)
return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr)
}
// Re-read freshness to return the started_at value.
return h.config.GetRecommendationsFreshness(ctx)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Discarded ok from MarkCollectionStarted causes redundant invokes and unsafe rollback.

The boolean return indicating whether this caller won the race is discarded. When ok == false (another collection is already in-flight):

  1. asyncInvokeSelf is called anyway — wasteful/duplicate Lambda invocation.
  2. If the invoke fails, ClearCollectionStarted clears the other caller's in-flight marker, potentially allowing premature re-triggering.

The comment at line 173-174 acknowledges the scenario but the code doesn't act on it.

Proposed fix: respect the race result
 	if schedulerARN != "" {
 		// Atomic mark. If another request already marked it, that's fine —
 		// we just return the current freshness.
-		_, err := h.config.MarkCollectionStarted(ctx)
+		ok, err := h.config.MarkCollectionStarted(ctx)
 		if err != nil {
 			return nil, fmt.Errorf("failed to mark cold-start collection: %w", err)
 		}
-		if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil {
-			// Best-effort: undo the mark so the next request can try again.
-			_ = h.config.ClearCollectionStarted(ctx)
-			return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr)
+		if ok {
+			if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil {
+				// Best-effort: undo the mark so the next request can try again.
+				_ = h.config.ClearCollectionStarted(ctx)
+				return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr)
+			}
 		}
 		// Re-read freshness to return the started_at value.
 		return h.config.GetRecommendationsFreshness(ctx)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_recommendations_refresh.go` around lines 172 - 186,
MarkCollectionStarted currently discards the boolean "ok" which indicates if
this caller won the race; capture its return (ok, err) from
h.config.MarkCollectionStarted(ctx) and if ok == false simply return
h.config.GetRecommendationsFreshness(ctx) without calling h.asyncInvokeSelf or
ClearCollectionStarted. Only if ok == true should you call
h.asyncInvokeSelf(ctx, schedulerARN); if that invoke fails then call
h.config.ClearCollectionStarted(ctx) to roll back and return the error. This
ensures you don't redundantly invoke asyncInvokeSelf or accidentally clear
another caller's in-flight marker; refer to MarkCollectionStarted,
asyncInvokeSelf, ClearCollectionStarted, and GetRecommendationsFreshness to
locate the changes.

Comment thread internal/api/router.go
Comment on lines 341 to +342
func (r *Router) refreshRecommendationsHandler(ctx context.Context, req *events.LambdaFunctionURLRequest, params map[string]string) (any, error) {
return r.h.scheduler.CollectRecommendations(ctx)
return r.h.postRefreshRecommendations(ctx, req)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh handler is permission-aware, but route-level auth still blocks non-admin users.

At Line 342, refresh now routes through postRefreshRecommendations, which performs requirePermission(view, recommendations). However, /api/recommendations/refresh is still registered with default AuthAdmin (Line 92), so non-admin users never reach this handler.

💡 Proposed fix
- {ExactPath: "/api/recommendations/refresh", Method: "POST", Handler: r.refreshRecommendationsHandler},
+ {ExactPath: "/api/recommendations/refresh", Method: "POST", Handler: r.refreshRecommendationsHandler, Auth: AuthUser},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/router.go` around lines 341 - 342, The route for
/api/recommendations/refresh is still protected by AuthAdmin at registration, so
non-admins are blocked before reaching Router.refreshRecommendationsHandler
which calls h.postRefreshRecommendations (and performs requirePermission(view,
recommendations)); update the route registration to use a non-admin auth policy
(e.g., AuthUser or the equivalent) or remove the Admin-only guard so the
handler's permission check can run, ensuring the registration change references
the same route path and uses the AuthUser/AuthNone constant instead of
AuthAdmin.

Comment on lines +143 to +149
// Bookkeeping: always clears last_collection_started_at on exit (success or
// failure) so the frontend polling loop can detect completion. The scheduler
// is invoked either by the cron EventBridge rule or by an async self-invoke
// from the POST /api/recommendations/refresh handler. In the async case,
// MarkCollectionStarted has already set last_collection_started_at; the
// cron case leaves it NULL (no async-invoke bookkeeping for cron runs, which
// are expected and not user-triggered).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unconditional clear can erase another in-flight run’s marker.

Because every scheduler run clears last_collection_started_at on exit, an overlapping cron invocation can clear the marker for a user-triggered async run before that run completes. That weakens the 409 guard and can end frontend polling early.

A robust fix is to move from plain timestamp clearing to ownership-aware clear (e.g., invocation token or compare-and-clear on expected started_at value).

Also applies to: 153-157, 254-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduler/scheduler.go` around lines 143 - 149, Current
unconditional clearing of last_collection_started_at can erase a concurrent
run’s marker; change to an ownership-aware clear by having MarkCollectionStarted
return an owner identifier (e.g., a UUID token or the started_at timestamp) and
then perform a compare-and-clear when exiting: only clear
last_collection_started_at if the stored value matches the returned owner
token/started_at. Update the scheduler exit/cleanup paths that currently clear
last_collection_started_at (the unconditional clear code around the exit logic
referenced plus the similar spots at the other mentioned locations) to use this
compare-and-clear check so overlapping cron and async runs don’t wipe each
other’s markers.

Comment on lines +153 to +157
// Always clear last_collection_started_at on exit so the frontend knows
// the collection has finished. Extracted into a helper to keep this
// function under the cyclomatic-complexity gate.
defer s.clearCollectionStartedBestEffort(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deferred marker clear should not reuse the request context.

At Line 156, the deferred cleanup uses ctx. If collection ends due deadline/cancel, Line 255 executes with a canceled context and fails immediately, leaving last_collection_started_at stuck until TTL recovery.

💡 Proposed fix
-	defer s.clearCollectionStartedBestEffort(ctx)
+	defer s.clearCollectionStartedBestEffort()

-func (s *Scheduler) clearCollectionStartedBestEffort(ctx context.Context) {
-	if err := s.config.ClearCollectionStarted(ctx); err != nil {
+func (s *Scheduler) clearCollectionStartedBestEffort() {
+	clearCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+	if err := s.config.ClearCollectionStarted(clearCtx); err != nil {
 		logging.Errorf("failed to clear collection started: %v", err)
 	}
 }

Also applies to: 254-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduler/scheduler.go` around lines 153 - 157, Deferred cleanup
currently captures and reuses the incoming request context (ctx) in the defer
call to s.clearCollectionStartedBestEffort, which can be canceled before defer
runs and prevents clearing last_collection_started_at; change the defer to
capture a non-cancelable context (e.g. context.Background() or a created context
with its own timeout) and pass that into s.clearCollectionStartedBestEffort so
the cleanup will run even if the original ctx is canceled — update the defer
statement where s.clearCollectionStartedBestEffort is registered (and any
similar defer sites around that block) to use the new background/independent
context instead of ctx.

Folds the previously-out-of-scope follow-ups into PR #257 itself so
deployed verification doesn't require a second PR:

  Terraform (compute/aws/lambda):
   - new aws_iam_role_policy.async_self_invoke grants
     lambda:InvokeFunction scoped to the API Lambda's own ARN, so the
     async-refresh handler can fire-and-forget self-invoke the
     scheduler-task code path with InvocationType=Event
   - new SCHEDULER_LAMBDA_ARN env var derived from
     aws_partition + aws_caller_identity + var.region + var.stack_name
     so the variable is the API Lambda's own ARN without a self-
     reference cycle on aws_lambda_function.main.arn
   - data sources aws_partition.current and aws_caller_identity.current
     added at the top of the module

  Frontend (api/recommendations.ts + freshness.ts):
   - RefreshRecommendationsResult: started_at + last_collected_at
     (matches the new RefreshResponse from the backend handler).
     Pre-#257 fields (recommendations / total_savings / providers)
     kept as optional during the transition so legacy fixtures still
     type-check.
   - RecommendationsFreshness: new optional last_collection_started_at
     so polling can detect when the in-flight async collection has
     finished (the backend clears it on both success and failure paths).
   - freshness.ts refresh-button click handler now polls
     /recommendations/freshness every 5s after the POST returns,
     waiting for last_collection_started_at to clear (or for
     last_collected_at to advance past started_at). 10-min safety cap
     so a stuck collection never wedges the UI. Optional started_at
     in the response degrades gracefully to "synchronous, exit
     immediately" so pre-#257 mocks still pass.

Tests: 1,453 frontend tests pass; existing freshness test suite
unmodified and green.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

@coderabbitai full review

I folded the originally-out-of-scope follow-ups into this PR:

  • Terraform IAM grant + env var
  • Frontend type updates + 5s polling loop after the POST returns

3,614 Go tests + 1,453 frontend tests + all pre-commit hooks (terraform fmt/validate/tflint, frontend build, etc.) green locally.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Tip

For best results, initiate chat on the files or code changes.

@cristim Thanks for the update! I'll kick off a fresh full review covering all the folded-in additions:

  • Terraform: IAM grant for API Lambda → scheduler Lambda invoke + SCHEDULER_LAMBDA_ARN env var wiring
  • Frontend: type updates for the new RefreshResponse shape + the 5-second polling loop after the POST /api/recommendations/refresh returns

Solid test coverage numbers — will review all the changes thoroughly.

[full_review]

…sh (#257)

Five actionable + one nitpick from CR review of e0f647a..170b18e:

  1. handler_recommendations_refresh.go::triggerColdStartCollect now
     respects the bool from MarkCollectionStarted: ok=false means
     another caller already won the race; we return current freshness
     without a second async-invoke and without ClearCollectionStarted
     (which would wipe the other caller's marker).

  2. postRefreshRecommendations re-reads freshness AFTER the async
     invoke so LastCollectionStartedAt reflects the value
     MarkCollectionStarted just wrote. Previously the freshness
     snapshot was taken before the mark, so StartedAt always fell
     back to time.Now() instead of the persisted value. The collect-
     trigger logic was extracted into runMarkedCollection so the
     parent stays under the cyclomatic-complexity gate.

  3. router.go: /api/recommendations/refresh now uses Auth: AuthUser
     (was the AuthAdmin default). The handler still enforces
     requirePermission(view, recommendations) so view-only roles can
     refresh data they're allowed to see; admin-only would have
     blocked them at the router layer.

  4. scheduler.go: defer-clear of last_collection_started_at now uses
     a fresh context.Background() with a 5s timeout. The request ctx
     could already be canceled by the time the defer runs (slow
     collect outliving caller deadline), which would leave the marker
     in place until the 5-min auto-recovery window kicked in.

  5. handler_test.go: TestHandler_HandleRequest_RefreshRecommendations
     now pins the synchronous-fallback branch via t.Setenv() and
     asserts a single 200 (was assert.Contains([200, 202])). Async-
     branch coverage moves to handler-level integration tests that
     inject lambdaInvoker directly.

CR finding #6 (compare-and-clear with owner token to avoid wiping a
concurrent run's marker) deliberately deferred — the existing 5-min
auto-recovery window already handles overlapping runs and the
owner-token refactor is a larger architectural change. Filed as
follow-up in the PR thread.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 4, 2026

CodeRabbit pass-1 response

Fixed (pushed in 49bc90d):

  1. triggerColdStartCollect now respects MarkCollectionStarted's ok bool (handler_recommendations_refresh.go:172-186). When another caller already won the race, return current freshness without a second async-invoke and without ClearCollectionStarted.
  2. Async branch re-reads freshness AFTER MarkCollectionStarted (handler_recommendations_refresh.go:102-114) so LastCollectionStartedAt reflects the persisted value. Trigger logic extracted into runMarkedCollection to keep the parent under the cyclomatic-complexity gate.
  3. /api/recommendations/refresh now uses Auth: AuthUser (router.go:341-342); the handler still enforces requirePermission(view, recommendations) so view-only roles can refresh data they're allowed to see.
  4. Defer-clear uses fresh context.Background() with 5s timeout (scheduler.go:153-157) so the marker is cleared even if the request ctx was canceled mid-collect.
  5. TestHandler_HandleRequest_RefreshRecommendations pins the sync-fallback branch via t.Setenv() and asserts a single 200 (was assert.Contains([200, 202])).

Dismissed with justification:

  • scheduler.go:143-149 compare-and-clear with owner token (priority/major): the existing 5-min auto-recovery window already handles overlapping cron + async runs. Owner-token bookkeeping (UUID per run, compare-and-clear on exit) is a larger architectural change — adding an owner parameter to MarkCollectionStarted/ClearCollectionStarted, a column migration, and matching mock updates across 5 packages. The narrow fix-now value is small (the practical race window is <60s of overlap given cron cadence), so deferring to a follow-up issue rather than expanding this PR's scope.

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Tip

For best results, initiate chat on the files or code changes.

@cristim Thanks for the detailed pass-1 response and the follow-through on the fixes. Here's a quick acknowledgement before the next pass:

Accepted fixes (49bc90d):

  1. triggerColdStartCollect — correctly short-circuits when ok == false from MarkCollectionStarted, avoiding a spurious second async-invoke and a mismatched ClearCollectionStarted.
  2. ✅ Freshness re-read post-MarkCollectionStartedLastCollectionStartedAt in the 202 response now reflects the persisted DB timestamp. The runMarkedCollection extraction also helps with cyclomatic complexity.
  3. ✅ Router auth level (AuthUser) — appropriate since requirePermission is the real enforcement gate.
  4. ✅ Defer-clear with context.Background() + 5s timeout — correct defensive pattern; avoids dropped cleanup on a cancelled request context.
  5. ✅ Test pinning to sync-fallback path — deterministic assertion on 200 is cleaner than the Contains([200, 202]) hedge.

Dismissed (owner token / compare-and-clear): The 5-minute auto-recovery window + the narrow practical overlap window (<60 s given cron cadence) is a reasonable justification for deferring the UUID-based owner bookkeeping to a follow-up. If you'd like me to open a tracking issue for that architectural follow-up, just say the word.

Kicking off a fresh full review now.

[full_review]

@cristim cristim merged commit 7eb77d1 into feat/multicloud-web-frontend May 4, 2026
5 checks passed
cristim added a commit that referenced this pull request May 4, 2026
…up) (#265)

The async refresh path landed in PR #260 but the Lambda self-invoke
payload didn't match what the receiving Lambda expects. The shipped
payload was:

  {"event": "scheduled_recommendations"}

But internal/server/handler.go::ScheduledEvent has no "event" field —
it has Action, Source, DetailType, Detail. Unmarshalling that JSON
gives Action="", and ParseScheduledEvent rejects every event with:

  unknown scheduled task action: ""

Verified live in /aws/lambda/cudly-dev-426fc8af-api request
3befb380-6d46-47e1-954b-69a2b59ea90a:

  16:49:40 Received unknown event (size: 37 bytes)
  16:49:40 Unknown event type, treating as scheduled event
  16:49:40 {"errorMessage":"failed to parse scheduled event:
              unknown scheduled task action: \"\""}

The 37-byte event = the {"event":"scheduled_recommendations"} string
exactly. The async self-invoke is firing successfully but every
invocation is rejected, so the "Refreshing..." banner never clears
and collection never runs — same user-visible symptom (refresh times
out + no fresh data) as if the env var or IAM grant was missing.

Fix: send the payload shape the dispatcher actually accepts:

  {"source": "aws.events", "action": "collect_recommendations"}

action="collect_recommendations" maps to TaskCollectRecommendations
in the ParseScheduledEvent switch. source="aws.events" makes
detectLambdaEventType in internal/server/lambda.go classify the
invoke consistently with EventBridge cron deliveries that already
exercise this code path.
cristim added a commit that referenced this pull request May 4, 2026
 #267 #268) (#269)

* perf(aws): parallelize 5 sequential service calls in GetAllRecommendations (closes #266)

The dispatcher in providers/aws/recommendations/client.go::GetAllRecommendations
ran 5 service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) one
after another with an artificial 100ms stagger between each. Each per-
service call hits AWS Cost Explorer's GetReservationPurchaseRecommendation,
which routinely takes multiple seconds. Total wall time per account scaled
as sum(per-service) instead of max(per-service).

Mirrors the Azure parallelisation in providers/azure/recommendations.go
(closes #258, commit b10326c). The five calls have no inter-service
dependency, so they can run concurrently under errgroup.WithContext. Each
goroutine captures its own error in a closure-scoped variable and returns
nil to the group so one service's failure does not cancel siblings
(matching the previous loop's `continue`-on-error tolerance).

Results are merged in the canonical order
EC2 → RDS → ElastiCache → OpenSearch → Redshift after Wait() so any
order-sensitive callers stay stable. After Wait(), ctx.Err() is checked
and propagated so a canceled parent ctx surfaces as context.Canceled /
context.DeadlineExceeded rather than being swallowed by the per-service
error-isolation goroutines (which all return nil).

The 100ms artificial stagger is removed — it was a sequential-mode rate-
limit hack that adds 400ms per account for no benefit under concurrent
fan-out (Cost Explorer rate limits apply per request, not per consecutive
request).

Behaviour change vs the previous sequential loop: per-service errors are
now logged at WARN via the new mergeServiceResults helper. The previous
loop swallowed them silently with a bare `continue`, leaving operators no
signal when a single service was misbehaving. The serviceResult struct +
mergeServiceResults helper extraction also keeps GetAllRecommendations
under the project's gocyclo gate (.golangci.yml min-complexity: 15).

New test TestGetAllRecommendations_PropagatesContextCancellation pins the
contract: calling GetAllRecommendations with a pre-cancelled context must
return context.Canceled. Mirrors the Azure equivalent at
providers/azure/recommendations_test.go::TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation.

Expected end-to-end after this change: ~max(individual call durations)
instead of sum + 4×100ms stagger.

* perf(gcp): parallelize region × service nested loops in GetRecommendations (closes #267)

The dispatcher in providers/gcp/recommendations.go::GetRecommendations was
doubly serial — `for region { for service { ... } }` — fetching Compute
Engine then Cloud SQL recommendations one region at a time. With ~30+
GCP regions × 2 services, that's 60+ sequential RPCs per account against
the project-scoped Recommender API. Even at 200ms per call this is a
12s+ floor; in practice most calls are slower. After #260 (async self-
invoke) this no longer triggers user-facing 502s, but it directly
inflates the scheduler Lambda's runtime and the in-flight freshness-
banner window.

Two-level concurrent fan-out, mirroring the Azure pattern in
providers/azure/recommendations.go (closes #258, commit b10326c) and
the AWS service-loop parallelisation (closes #266):

  - Outer: errgroup over regions, capped at gcpRegionConcurrency()
    (CUDLY_GCP_REGION_PARALLELISM, default 10) to stay polite to the
    project-scoped Recommender API quota. Lower than the existing
    account-level CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) because
    each account already gets its own goroutine slot from
    fanOutPerAccount; cumulative concurrency would multiply otherwise.
  - Inner: within each region's goroutine, Compute + Cloud SQL run as
    two further goroutines under a per-region sub-errgroup, so per-
    region cost is max(compute, sql) rather than compute + sql.

Each goroutine returns nil to its errgroup so a single per-(region,
service) failure does not cancel siblings — preserves the previous
silent-skip-on-err shape semantically.

Behaviour change vs the previous nested for-loops: per-(region,
service) errors that were silently swallowed by the previous
`if err == nil { ... }` shape are now logged at WARN with the
region+service tag so misconfigured projects are diagnosable. Errors
still don't propagate to callers (preserving silent-skip), only the
log severity changes from "invisible" to WARN.

After the outer Wait(), ctx.Err() is checked and propagated so a
canceled parent ctx surfaces as context.Canceled / context.Deadline-
Exceeded rather than being swallowed by the per-region error-isolation
goroutines (which all return nil).

Result merging walks regions in sorted order and appends compute then
sql per region — output is deterministic independent of GCP API
region-list ordering or goroutine completion order.

The collectRegion helper extraction also keeps GetRecommendations
under the project's gocyclo gate (.golangci.yml min-complexity: 15)
after the post-Wait ctx.Err() block was added.

Tests added:
  - TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation
    pins the contract that a pre-cancelled ctx surfaces as context.Canceled.
  - TestGCPRegionConcurrency pins the env-knob parser semantics
    (unset/positive/non-numeric/zero/negative/explicit-unset).

Expected end-to-end after this change:
ceil(N_regions / cap) × max(compute, sql) instead of
N_regions × (compute + sql) — for a project with 30 regions and a cap
of 10, that's a ~6× speedup before counting the per-region service-
level parallelism win.

* perf(scheduler): parallelize provider collection loop AWS/Azure/GCP (closes #268)

The provider loop in scheduler.go::CollectRecommendations ran AWS, Azure
and GCP collection sequentially: total wall time = sum(per-provider)
rather than max(per-provider). After #260 (async self-invoke) the user-
facing 502 on POST /api/recommendations/refresh is gone, but the
scheduler Lambda itself still pays the serial cost — visible as a
longer "collection in flight" freshness-banner window and reduced
cron-tick headroom.

Provider-level fan-out under errgroup, mirroring the Azure and AWS
service-loop parallelisations (#258 / #266). Each per-provider goroutine
returns nil to the group so a single provider's error does not cancel
its siblings — preserves the previous loop's `continue`-on-error
semantics. Per-provider results are written into a map under a single
mutex (the same pattern as fanOutPerAccount at scheduler.go:393).

After Wait, ctx.Err() is checked and propagated so a canceled parent
ctx surfaces as context.Canceled / context.DeadlineExceeded rather than
being swallowed by the per-provider error-isolation goroutines (which
all return nil).

Deterministic merge: walk EnabledProviders in config order to assemble
successfulProviders / successfulCollects / allRecommendations /
totalSavings / failedProviders. Order is by config, NOT by goroutine
completion order or by Go's randomised map iteration — keeps existing
tests stable and the freshness-banner / notification-email content
predictable.

The new collectAllProviders helper extraction also keeps
CollectRecommendations under the project's gocyclo gate
(.golangci.yml min-complexity: 15) after the errgroup + post-Wait
ctx.Err() block was added.

No concurrency cap on the outer fan-out — the universe is at most 3
providers; the per-account fan-out inside each provider is still
bounded by CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) and per-region
GCP fan-out by CUDLY_GCP_REGION_PARALLELISM (default 10, from #267).

Test mocks updated: per-provider mock methods (CreateAndValidateProvider,
ListCloudAccounts, GetRecommendationsClient, GetAllRecommendations) are
now invoked from inside the errgroup goroutine, so they receive the
errgroup-derived gctx rather than the caller's ctx. Mock setups changed
from literal ctx to mock.Anything where the call sits inside the
goroutine path; calls made before/after the fan-out (GetGlobalConfig,
SendNewRecommendationsNotification, persistCollection writes) still
pin literal ctx because they run on the caller's context unchanged.

New TestScheduler_CollectRecommendations_ParallelProviders pins three
contracts:

  1. successfulProviders ordering matches EnabledProviders config order,
     not goroutine completion order or alphabetical. Asserted with a
     deliberately non-alphabetical config (["gcp", "aws", "azure"])
     where AWS fails (ambient credentials → factory mock returns error)
     and Azure/GCP succeed empty (no enabled accounts) — the resulting
     successfulProviders must be ["gcp", "azure"], distinguishable from
     any of: input order, alphabetical, or arbitrary map iteration.

  2. ctx cancellation propagates: a pre-cancelled ctx surfaces as
     context.Canceled (not a "successful but empty" CollectResult).

Expected wall-time impact: CollectRecommendations drops from
sum(per-provider) to max(per-provider). With AWS+Azure+GCP all enabled
and Azure typically the slowest at ~15-20s post-#258, total drops from
~30-45s to ~max(15-20s).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/m Days impact/many Affects most users priority/p2 Backlog-worthy severity/medium Moderate harm triaged Item has been triaged type/bug Defect urgency/this-quarter Within the quarter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant