feat(api): async recommendation refresh via scheduler Lambda invoke (closes #257)#260
Conversation
…loses #257) The 502 on user-triggered refresh and on first-load (cold-start) was caused by the API Lambda doing the full provider-fan-out synchronously and exceeding its 60s timeout — Azure alone takes >60s. The collection completes successfully in the scheduler Lambda (longer timeout), but the API Lambda was the wrong place to do the work. This change splits the request path so the user-facing API returns immediately: POST /api/recommendations/refresh: 1. requirePermission(view, recommendations) 2. read freshness for the response body 3. atomically MarkCollectionStarted (returns false → 409 if another collection is already in flight, with a 5-minute auto-recovery window in case the scheduler crashed mid-run and never cleared) 4. async-invoke the scheduler Lambda via lambda:InvokeFunction with InvocationType=Event (fire-and-forget) when SCHEDULER_LAMBDA_ARN is set; otherwise fall back to synchronous CollectRecommendations so local/HTTP-mode dev still works 5. return 202 with started_at + previous last_collected_at so the frontend can render "Last refreshed N min ago — collecting…" The same async-invoke path is wired into the cold-start branch of GET /api/recommendations so first-load no longer blocks on a 60s provider fan-out. Bookkeeping additions in StoreInterface (Postgres + 5 test mocks across analytics/api/purchase/scheduler/server packages): MarkCollectionStarted(ctx) (bool, error) // race-safe set ClearCollectionStarted(ctx) error // called by scheduler at end Migration 000047 adds the last_collection_started_at column on recommendations_state (default NULL). Out of scope for this PR (filed as separate triaged follow-ups when needed): the API Lambda's IAM grant for lambda:InvokeFunction on the scheduler ARN (Terraform), and the frontend refresh button + polling banner. The backend handler degrades to synchronous-collect when SCHEDULER_LAMBDA_ARN is unset, so deployed verification is gated on the IAM + env-var follow-up.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR enables asynchronous recommendation collection to prevent API Lambda timeouts on cold-start. It adds collection-state tracking via a new ChangesAsync Recommendation Collection State Machine
Sequence DiagramsequenceDiagram
participant Client
participant APIHandler as API Handler
participant ConfigStore as Config Store
participant LambdaInvoker as Lambda Invoker
participant SchedulerLambda as Scheduler Lambda
rect rgba(200, 150, 255, 0.5)
Note over Client,SchedulerLambda: Async Recommendation Collection (POST /api/recommendations/refresh)
Client->>APIHandler: POST /api/recommendations/refresh
APIHandler->>ConfigStore: MarkCollectionStarted()
alt Collection already in-flight
ConfigStore-->>APIHandler: false (409 Conflict)
APIHandler-->>Client: 409 Collection in progress
else First to acquire lock
ConfigStore-->>APIHandler: true, nil
APIHandler->>ConfigStore: GetRecommendationsFreshness()
ConfigStore-->>APIHandler: freshness
alt SCHEDULER_LAMBDA_ARN configured
APIHandler->>LambdaInvoker: asyncInvokeSelf(SCHEDULER_LAMBDA_ARN)
LambdaInvoker->>SchedulerLambda: Invoke(InvocationTypeEvent, event="scheduled_recommendations")
SchedulerLambda-->>LambdaInvoker: OK (async, no wait)
alt Invoke succeeded
APIHandler-->>Client: 202 {started_at, last_collected_at}
SchedulerLambda->>ConfigStore: CollectRecommendations() + defer ClearCollectionStarted()
ConfigStore-->>SchedulerLambda: done
else Invoke failed
APIHandler->>ConfigStore: ClearCollectionStarted() [rollback]
ConfigStore-->>APIHandler: cleared
APIHandler-->>Client: 500/502 error
end
else Fallback: no ARN, sync collect
APIHandler->>ConfigStore: CollectRecommendations()
ConfigStore-->>APIHandler: collected
APIHandler-->>Client: 202 {started_at, last_collected_at}
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
internal/api/handler_test.go (1)
498-528: ⚡ Quick winMake this refresh-path test deterministic instead of accepting both 200 and 202.
The broad status assertion can mask regressions. Pin the branch via env setup and assert one exact outcome.
💡 Suggested test hardening
func TestHandler_HandleRequest_RefreshRecommendations(t *testing.T) { ctx := context.Background() + t.Setenv("SCHEDULER_LAMBDA_ARN", "") // force sync fallback path for this test mockScheduler := new(MockScheduler) mockAuth := new(MockAuthService) mockStore := new(MockConfigStore) @@ - // 202 in async (Lambda) mode; handler degrades to a synchronous collect - // when SCHEDULER_LAMBDA_ARN is unset (the test default), which still - // returns 200 because the response is fully populated by the time we reply. - assert.Contains(t, []int{200, 202}, resp.StatusCode) + assert.Equal(t, 200, resp.StatusCode) + mockScheduler.AssertCalled(t, "CollectRecommendations", mock.Anything) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/handler_test.go` around lines 498 - 528, The test currently accepts either 200 or 202 which hides branch regressions; make it deterministic by explicitly setting the SCHEDULER_LAMBDA_ARN environment before creating the Handler so the code follows the desired path, then assert the single expected status code. For example, to pin the synchronous path, ensure SCHEDULER_LAMBDA_ARN is unset (or set to an empty string) prior to calling Handler.HandleRequest and replace the assert.Contains on resp.StatusCode with an assert.Equal for 200; to pin the asynchronous path set SCHEDULER_LAMBDA_ARN to a non-empty value and assert 202. Reference Handler.HandleRequest and the mockScheduler/mockStore setup when applying the env change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/api/handler_recommendations_refresh.go`:
- Around line 172-186: MarkCollectionStarted currently discards the boolean "ok"
which indicates if this caller won the race; capture its return (ok, err) from
h.config.MarkCollectionStarted(ctx) and if ok == false simply return
h.config.GetRecommendationsFreshness(ctx) without calling h.asyncInvokeSelf or
ClearCollectionStarted. Only if ok == true should you call
h.asyncInvokeSelf(ctx, schedulerARN); if that invoke fails then call
h.config.ClearCollectionStarted(ctx) to roll back and return the error. This
ensures you don't redundantly invoke asyncInvokeSelf or accidentally clear
another caller's in-flight marker; refer to MarkCollectionStarted,
asyncInvokeSelf, ClearCollectionStarted, and GetRecommendationsFreshness to
locate the changes.
- Around line 102-114: The async branch currently uses a stale freshness
snapshot read before calling MarkCollectionStarted, so LastCollectionStartedAt
is nil and startedAt falls back to now; after a successful async invoke you
should re-fetch freshness (same way the sync path does) and use
freshness.LastCollectionStartedAt if non-nil to set StartedAt in the
RefreshResponse (update the code around the async invoke to call the freshness
read function again and mirror the sync logic that assigns startedAt, preserving
use of MarkCollectionStarted and returning the refreshed LastCollectedAt and
StartedAt values).
In `@internal/api/router.go`:
- Around line 341-342: The route for /api/recommendations/refresh is still
protected by AuthAdmin at registration, so non-admins are blocked before
reaching Router.refreshRecommendationsHandler which calls
h.postRefreshRecommendations (and performs requirePermission(view,
recommendations)); update the route registration to use a non-admin auth policy
(e.g., AuthUser or the equivalent) or remove the Admin-only guard so the
handler's permission check can run, ensuring the registration change references
the same route path and uses the AuthUser/AuthNone constant instead of
AuthAdmin.
In `@internal/scheduler/scheduler.go`:
- Around line 153-157: Deferred cleanup currently captures and reuses the
incoming request context (ctx) in the defer call to
s.clearCollectionStartedBestEffort, which can be canceled before defer runs and
prevents clearing last_collection_started_at; change the defer to capture a
non-cancelable context (e.g. context.Background() or a created context with its
own timeout) and pass that into s.clearCollectionStartedBestEffort so the
cleanup will run even if the original ctx is canceled — update the defer
statement where s.clearCollectionStartedBestEffort is registered (and any
similar defer sites around that block) to use the new background/independent
context instead of ctx.
- Around line 143-149: Current unconditional clearing of
last_collection_started_at can erase a concurrent run’s marker; change to an
ownership-aware clear by having MarkCollectionStarted return an owner identifier
(e.g., a UUID token or the started_at timestamp) and then perform a
compare-and-clear when exiting: only clear last_collection_started_at if the
stored value matches the returned owner token/started_at. Update the scheduler
exit/cleanup paths that currently clear last_collection_started_at (the
unconditional clear code around the exit logic referenced plus the similar spots
at the other mentioned locations) to use this compare-and-clear check so
overlapping cron and async runs don’t wipe each other’s markers.
---
Nitpick comments:
In `@internal/api/handler_test.go`:
- Around line 498-528: The test currently accepts either 200 or 202 which hides
branch regressions; make it deterministic by explicitly setting the
SCHEDULER_LAMBDA_ARN environment before creating the Handler so the code follows
the desired path, then assert the single expected status code. For example, to
pin the synchronous path, ensure SCHEDULER_LAMBDA_ARN is unset (or set to an
empty string) prior to calling Handler.HandleRequest and replace the
assert.Contains on resp.StatusCode with an assert.Equal for 200; to pin the
asynchronous path set SCHEDULER_LAMBDA_ARN to a non-empty value and assert 202.
Reference Handler.HandleRequest and the mockScheduler/mockStore setup when
applying the env change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bab35008-6b17-4228-bbf9-3d56361a8b13
📒 Files selected for processing (17)
internal/analytics/collector_test.gointernal/api/handler.gointernal/api/handler_recommendations_refresh.gointernal/api/handler_test.gointernal/api/mocks_test.gointernal/api/router.gointernal/config/interfaces.gointernal/config/store_postgres_recommendations.gointernal/config/types.gointernal/database/postgres/migrations/000047_recommendations_state_started_at.down.sqlinternal/database/postgres/migrations/000047_recommendations_state_started_at.up.sqlinternal/purchase/mocks_test.gointernal/scheduler/scheduler.gointernal/scheduler/scheduler_overrides_test.gointernal/scheduler/scheduler_suppressions_test.gointernal/scheduler/scheduler_test.gointernal/server/test_helpers_test.go
| // Re-read started_at (set by MarkCollectionStarted) for the 202 body. | ||
| // In the synchronous path this will be NULL (cleared by the defer), so we | ||
| // fall back to now as a reasonable sentinel. | ||
| now := time.Now().UTC() | ||
| startedAt := now | ||
| if freshness.LastCollectionStartedAt != nil { | ||
| startedAt = *freshness.LastCollectionStartedAt | ||
| } | ||
|
|
||
| return &RefreshResponse{ | ||
| StartedAt: startedAt, | ||
| LastCollectedAt: freshness.LastCollectedAt, | ||
| }, nil |
There was a problem hiding this comment.
Stale freshness in async path yields incorrect started_at.
In the async branch, freshness was read at line 62 before MarkCollectionStarted was called. Consequently, freshness.LastCollectionStartedAt will be nil (or stale), and startedAt always falls back to the local now sentinel rather than the actual database timestamp.
The sync path correctly re-reads freshness (lines 96-99) after collection. The async path should do the same after a successful invoke.
Proposed fix: re-read freshness after async invoke
} else {
// Non-Lambda (HTTP) mode: collect synchronously. ClearCollectionStarted
// is called inside CollectRecommendations via the defer, so we don't
// need to call it here.
if _, collectErr := h.scheduler.CollectRecommendations(ctx); collectErr != nil {
return nil, fmt.Errorf("collection failed: %w", collectErr)
}
+ }
+
+ // Re-read freshness to capture the actual started_at / collected_at values.
+ freshness, err = h.config.GetRecommendationsFreshness(ctx)
+ if err != nil {
+ return nil, fmt.Errorf("failed to read freshness: %w", err)
+ }
- // Re-read freshness after synchronous collect so the response reflects
- // the actual collection time.
- freshness, err = h.config.GetRecommendationsFreshness(ctx)
- if err != nil {
- return nil, fmt.Errorf("failed to read freshness after collect: %w", err)
- }
- }
// Re-read started_at (set by MarkCollectionStarted) for the 202 body.
- // In the synchronous path this will be NULL (cleared by the defer), so we
- // fall back to now as a reasonable sentinel.
+ // In the synchronous path this may be NULL (cleared by defer), so fall back
+ // to now as a reasonable sentinel.
now := time.Now().UTC()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/api/handler_recommendations_refresh.go` around lines 102 - 114, The
async branch currently uses a stale freshness snapshot read before calling
MarkCollectionStarted, so LastCollectionStartedAt is nil and startedAt falls
back to now; after a successful async invoke you should re-fetch freshness (same
way the sync path does) and use freshness.LastCollectionStartedAt if non-nil to
set StartedAt in the RefreshResponse (update the code around the async invoke to
call the freshness read function again and mirror the sync logic that assigns
startedAt, preserving use of MarkCollectionStarted and returning the refreshed
LastCollectedAt and StartedAt values).
| if schedulerARN != "" { | ||
| // Atomic mark. If another request already marked it, that's fine — | ||
| // we just return the current freshness. | ||
| _, err := h.config.MarkCollectionStarted(ctx) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to mark cold-start collection: %w", err) | ||
| } | ||
| if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil { | ||
| // Best-effort: undo the mark so the next request can try again. | ||
| _ = h.config.ClearCollectionStarted(ctx) | ||
| return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr) | ||
| } | ||
| // Re-read freshness to return the started_at value. | ||
| return h.config.GetRecommendationsFreshness(ctx) | ||
| } |
There was a problem hiding this comment.
Discarded ok from MarkCollectionStarted causes redundant invokes and unsafe rollback.
The boolean return indicating whether this caller won the race is discarded. When ok == false (another collection is already in-flight):
asyncInvokeSelfis called anyway — wasteful/duplicate Lambda invocation.- If the invoke fails,
ClearCollectionStartedclears the other caller's in-flight marker, potentially allowing premature re-triggering.
The comment at line 173-174 acknowledges the scenario but the code doesn't act on it.
Proposed fix: respect the race result
if schedulerARN != "" {
// Atomic mark. If another request already marked it, that's fine —
// we just return the current freshness.
- _, err := h.config.MarkCollectionStarted(ctx)
+ ok, err := h.config.MarkCollectionStarted(ctx)
if err != nil {
return nil, fmt.Errorf("failed to mark cold-start collection: %w", err)
}
- if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil {
- // Best-effort: undo the mark so the next request can try again.
- _ = h.config.ClearCollectionStarted(ctx)
- return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr)
+ if ok {
+ if invokeErr := h.asyncInvokeSelf(ctx, schedulerARN); invokeErr != nil {
+ // Best-effort: undo the mark so the next request can try again.
+ _ = h.config.ClearCollectionStarted(ctx)
+ return nil, fmt.Errorf("failed to trigger cold-start collect: %w", invokeErr)
+ }
}
// Re-read freshness to return the started_at value.
return h.config.GetRecommendationsFreshness(ctx)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/api/handler_recommendations_refresh.go` around lines 172 - 186,
MarkCollectionStarted currently discards the boolean "ok" which indicates if
this caller won the race; capture its return (ok, err) from
h.config.MarkCollectionStarted(ctx) and if ok == false simply return
h.config.GetRecommendationsFreshness(ctx) without calling h.asyncInvokeSelf or
ClearCollectionStarted. Only if ok == true should you call
h.asyncInvokeSelf(ctx, schedulerARN); if that invoke fails then call
h.config.ClearCollectionStarted(ctx) to roll back and return the error. This
ensures you don't redundantly invoke asyncInvokeSelf or accidentally clear
another caller's in-flight marker; refer to MarkCollectionStarted,
asyncInvokeSelf, ClearCollectionStarted, and GetRecommendationsFreshness to
locate the changes.
| func (r *Router) refreshRecommendationsHandler(ctx context.Context, req *events.LambdaFunctionURLRequest, params map[string]string) (any, error) { | ||
| return r.h.scheduler.CollectRecommendations(ctx) | ||
| return r.h.postRefreshRecommendations(ctx, req) |
There was a problem hiding this comment.
Refresh handler is permission-aware, but route-level auth still blocks non-admin users.
At Line 342, refresh now routes through postRefreshRecommendations, which performs requirePermission(view, recommendations). However, /api/recommendations/refresh is still registered with default AuthAdmin (Line 92), so non-admin users never reach this handler.
💡 Proposed fix
- {ExactPath: "/api/recommendations/refresh", Method: "POST", Handler: r.refreshRecommendationsHandler},
+ {ExactPath: "/api/recommendations/refresh", Method: "POST", Handler: r.refreshRecommendationsHandler, Auth: AuthUser},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/api/router.go` around lines 341 - 342, The route for
/api/recommendations/refresh is still protected by AuthAdmin at registration, so
non-admins are blocked before reaching Router.refreshRecommendationsHandler
which calls h.postRefreshRecommendations (and performs requirePermission(view,
recommendations)); update the route registration to use a non-admin auth policy
(e.g., AuthUser or the equivalent) or remove the Admin-only guard so the
handler's permission check can run, ensuring the registration change references
the same route path and uses the AuthUser/AuthNone constant instead of
AuthAdmin.
| // Bookkeeping: always clears last_collection_started_at on exit (success or | ||
| // failure) so the frontend polling loop can detect completion. The scheduler | ||
| // is invoked either by the cron EventBridge rule or by an async self-invoke | ||
| // from the POST /api/recommendations/refresh handler. In the async case, | ||
| // MarkCollectionStarted has already set last_collection_started_at; the | ||
| // cron case leaves it NULL (no async-invoke bookkeeping for cron runs, which | ||
| // are expected and not user-triggered). |
There was a problem hiding this comment.
Unconditional clear can erase another in-flight run’s marker.
Because every scheduler run clears last_collection_started_at on exit, an overlapping cron invocation can clear the marker for a user-triggered async run before that run completes. That weakens the 409 guard and can end frontend polling early.
A robust fix is to move from plain timestamp clearing to ownership-aware clear (e.g., invocation token or compare-and-clear on expected started_at value).
Also applies to: 153-157, 254-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduler/scheduler.go` around lines 143 - 149, Current
unconditional clearing of last_collection_started_at can erase a concurrent
run’s marker; change to an ownership-aware clear by having MarkCollectionStarted
return an owner identifier (e.g., a UUID token or the started_at timestamp) and
then perform a compare-and-clear when exiting: only clear
last_collection_started_at if the stored value matches the returned owner
token/started_at. Update the scheduler exit/cleanup paths that currently clear
last_collection_started_at (the unconditional clear code around the exit logic
referenced plus the similar spots at the other mentioned locations) to use this
compare-and-clear check so overlapping cron and async runs don’t wipe each
other’s markers.
| // Always clear last_collection_started_at on exit so the frontend knows | ||
| // the collection has finished. Extracted into a helper to keep this | ||
| // function under the cyclomatic-complexity gate. | ||
| defer s.clearCollectionStartedBestEffort(ctx) | ||
|
|
There was a problem hiding this comment.
Deferred marker clear should not reuse the request context.
At Line 156, the deferred cleanup uses ctx. If collection ends due deadline/cancel, Line 255 executes with a canceled context and fails immediately, leaving last_collection_started_at stuck until TTL recovery.
💡 Proposed fix
- defer s.clearCollectionStartedBestEffort(ctx)
+ defer s.clearCollectionStartedBestEffort()
-func (s *Scheduler) clearCollectionStartedBestEffort(ctx context.Context) {
- if err := s.config.ClearCollectionStarted(ctx); err != nil {
+func (s *Scheduler) clearCollectionStartedBestEffort() {
+ clearCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+ defer cancel()
+ if err := s.config.ClearCollectionStarted(clearCtx); err != nil {
logging.Errorf("failed to clear collection started: %v", err)
}
}Also applies to: 254-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduler/scheduler.go` around lines 153 - 157, Deferred cleanup
currently captures and reuses the incoming request context (ctx) in the defer
call to s.clearCollectionStartedBestEffort, which can be canceled before defer
runs and prevents clearing last_collection_started_at; change the defer to
capture a non-cancelable context (e.g. context.Background() or a created context
with its own timeout) and pass that into s.clearCollectionStartedBestEffort so
the cleanup will run even if the original ctx is canceled — update the defer
statement where s.clearCollectionStartedBestEffort is registered (and any
similar defer sites around that block) to use the new background/independent
context instead of ctx.
Folds the previously-out-of-scope follow-ups into PR #257 itself so deployed verification doesn't require a second PR: Terraform (compute/aws/lambda): - new aws_iam_role_policy.async_self_invoke grants lambda:InvokeFunction scoped to the API Lambda's own ARN, so the async-refresh handler can fire-and-forget self-invoke the scheduler-task code path with InvocationType=Event - new SCHEDULER_LAMBDA_ARN env var derived from aws_partition + aws_caller_identity + var.region + var.stack_name so the variable is the API Lambda's own ARN without a self- reference cycle on aws_lambda_function.main.arn - data sources aws_partition.current and aws_caller_identity.current added at the top of the module Frontend (api/recommendations.ts + freshness.ts): - RefreshRecommendationsResult: started_at + last_collected_at (matches the new RefreshResponse from the backend handler). Pre-#257 fields (recommendations / total_savings / providers) kept as optional during the transition so legacy fixtures still type-check. - RecommendationsFreshness: new optional last_collection_started_at so polling can detect when the in-flight async collection has finished (the backend clears it on both success and failure paths). - freshness.ts refresh-button click handler now polls /recommendations/freshness every 5s after the POST returns, waiting for last_collection_started_at to clear (or for last_collected_at to advance past started_at). 10-min safety cap so a stuck collection never wedges the UI. Optional started_at in the response degrades gracefully to "synchronous, exit immediately" so pre-#257 mocks still pass. Tests: 1,453 frontend tests pass; existing freshness test suite unmodified and green.
|
@coderabbitai full review I folded the originally-out-of-scope follow-ups into this PR:
3,614 Go tests + 1,453 frontend tests + all pre-commit hooks (terraform fmt/validate/tflint, frontend build, etc.) green locally. |
|
Tip For best results, initiate chat on the files or code changes.
Solid test coverage numbers — will review all the changes thoroughly. [full_review] |
…sh (#257) Five actionable + one nitpick from CR review of e0f647a..170b18e: 1. handler_recommendations_refresh.go::triggerColdStartCollect now respects the bool from MarkCollectionStarted: ok=false means another caller already won the race; we return current freshness without a second async-invoke and without ClearCollectionStarted (which would wipe the other caller's marker). 2. postRefreshRecommendations re-reads freshness AFTER the async invoke so LastCollectionStartedAt reflects the value MarkCollectionStarted just wrote. Previously the freshness snapshot was taken before the mark, so StartedAt always fell back to time.Now() instead of the persisted value. The collect- trigger logic was extracted into runMarkedCollection so the parent stays under the cyclomatic-complexity gate. 3. router.go: /api/recommendations/refresh now uses Auth: AuthUser (was the AuthAdmin default). The handler still enforces requirePermission(view, recommendations) so view-only roles can refresh data they're allowed to see; admin-only would have blocked them at the router layer. 4. scheduler.go: defer-clear of last_collection_started_at now uses a fresh context.Background() with a 5s timeout. The request ctx could already be canceled by the time the defer runs (slow collect outliving caller deadline), which would leave the marker in place until the 5-min auto-recovery window kicked in. 5. handler_test.go: TestHandler_HandleRequest_RefreshRecommendations now pins the synchronous-fallback branch via t.Setenv() and asserts a single 200 (was assert.Contains([200, 202])). Async- branch coverage moves to handler-level integration tests that inject lambdaInvoker directly. CR finding #6 (compare-and-clear with owner token to avoid wiping a concurrent run's marker) deliberately deferred — the existing 5-min auto-recovery window already handles overlapping runs and the owner-token refactor is a larger architectural change. Filed as follow-up in the PR thread.
CodeRabbit pass-1 responseFixed (pushed in 49bc90d):
Dismissed with justification:
@coderabbitai full review |
|
Tip For best results, initiate chat on the files or code changes.
Accepted fixes (49bc90d):
Dismissed (owner token / compare-and-clear): The 5-minute auto-recovery window + the narrow practical overlap window (<60 s given cron cadence) is a reasonable justification for deferring the Kicking off a fresh full review now. [full_review] |
…up) (#265) The async refresh path landed in PR #260 but the Lambda self-invoke payload didn't match what the receiving Lambda expects. The shipped payload was: {"event": "scheduled_recommendations"} But internal/server/handler.go::ScheduledEvent has no "event" field — it has Action, Source, DetailType, Detail. Unmarshalling that JSON gives Action="", and ParseScheduledEvent rejects every event with: unknown scheduled task action: "" Verified live in /aws/lambda/cudly-dev-426fc8af-api request 3befb380-6d46-47e1-954b-69a2b59ea90a: 16:49:40 Received unknown event (size: 37 bytes) 16:49:40 Unknown event type, treating as scheduled event 16:49:40 {"errorMessage":"failed to parse scheduled event: unknown scheduled task action: \"\""} The 37-byte event = the {"event":"scheduled_recommendations"} string exactly. The async self-invoke is firing successfully but every invocation is rejected, so the "Refreshing..." banner never clears and collection never runs — same user-visible symptom (refresh times out + no fresh data) as if the env var or IAM grant was missing. Fix: send the payload shape the dispatcher actually accepts: {"source": "aws.events", "action": "collect_recommendations"} action="collect_recommendations" maps to TaskCollectRecommendations in the ParseScheduledEvent switch. source="aws.events" makes detectLambdaEventType in internal/server/lambda.go classify the invoke consistently with EventBridge cron deliveries that already exercise this code path.
#267 #268) (#269) * perf(aws): parallelize 5 sequential service calls in GetAllRecommendations (closes #266) The dispatcher in providers/aws/recommendations/client.go::GetAllRecommendations ran 5 service calls (EC2 / RDS / ElastiCache / OpenSearch / Redshift) one after another with an artificial 100ms stagger between each. Each per- service call hits AWS Cost Explorer's GetReservationPurchaseRecommendation, which routinely takes multiple seconds. Total wall time per account scaled as sum(per-service) instead of max(per-service). Mirrors the Azure parallelisation in providers/azure/recommendations.go (closes #258, commit b10326c). The five calls have no inter-service dependency, so they can run concurrently under errgroup.WithContext. Each goroutine captures its own error in a closure-scoped variable and returns nil to the group so one service's failure does not cancel siblings (matching the previous loop's `continue`-on-error tolerance). Results are merged in the canonical order EC2 → RDS → ElastiCache → OpenSearch → Redshift after Wait() so any order-sensitive callers stay stable. After Wait(), ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.DeadlineExceeded rather than being swallowed by the per-service error-isolation goroutines (which all return nil). The 100ms artificial stagger is removed — it was a sequential-mode rate- limit hack that adds 400ms per account for no benefit under concurrent fan-out (Cost Explorer rate limits apply per request, not per consecutive request). Behaviour change vs the previous sequential loop: per-service errors are now logged at WARN via the new mergeServiceResults helper. The previous loop swallowed them silently with a bare `continue`, leaving operators no signal when a single service was misbehaving. The serviceResult struct + mergeServiceResults helper extraction also keeps GetAllRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15). New test TestGetAllRecommendations_PropagatesContextCancellation pins the contract: calling GetAllRecommendations with a pre-cancelled context must return context.Canceled. Mirrors the Azure equivalent at providers/azure/recommendations_test.go::TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation. Expected end-to-end after this change: ~max(individual call durations) instead of sum + 4×100ms stagger. * perf(gcp): parallelize region × service nested loops in GetRecommendations (closes #267) The dispatcher in providers/gcp/recommendations.go::GetRecommendations was doubly serial — `for region { for service { ... } }` — fetching Compute Engine then Cloud SQL recommendations one region at a time. With ~30+ GCP regions × 2 services, that's 60+ sequential RPCs per account against the project-scoped Recommender API. Even at 200ms per call this is a 12s+ floor; in practice most calls are slower. After #260 (async self- invoke) this no longer triggers user-facing 502s, but it directly inflates the scheduler Lambda's runtime and the in-flight freshness- banner window. Two-level concurrent fan-out, mirroring the Azure pattern in providers/azure/recommendations.go (closes #258, commit b10326c) and the AWS service-loop parallelisation (closes #266): - Outer: errgroup over regions, capped at gcpRegionConcurrency() (CUDLY_GCP_REGION_PARALLELISM, default 10) to stay polite to the project-scoped Recommender API quota. Lower than the existing account-level CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) because each account already gets its own goroutine slot from fanOutPerAccount; cumulative concurrency would multiply otherwise. - Inner: within each region's goroutine, Compute + Cloud SQL run as two further goroutines under a per-region sub-errgroup, so per- region cost is max(compute, sql) rather than compute + sql. Each goroutine returns nil to its errgroup so a single per-(region, service) failure does not cancel siblings — preserves the previous silent-skip-on-err shape semantically. Behaviour change vs the previous nested for-loops: per-(region, service) errors that were silently swallowed by the previous `if err == nil { ... }` shape are now logged at WARN with the region+service tag so misconfigured projects are diagnosable. Errors still don't propagate to callers (preserving silent-skip), only the log severity changes from "invisible" to WARN. After the outer Wait(), ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.Deadline- Exceeded rather than being swallowed by the per-region error-isolation goroutines (which all return nil). Result merging walks regions in sorted order and appends compute then sql per region — output is deterministic independent of GCP API region-list ordering or goroutine completion order. The collectRegion helper extraction also keeps GetRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15) after the post-Wait ctx.Err() block was added. Tests added: - TestRecommendationsClientAdapter_GetRecommendations_PropagatesContextCancellation pins the contract that a pre-cancelled ctx surfaces as context.Canceled. - TestGCPRegionConcurrency pins the env-knob parser semantics (unset/positive/non-numeric/zero/negative/explicit-unset). Expected end-to-end after this change: ceil(N_regions / cap) × max(compute, sql) instead of N_regions × (compute + sql) — for a project with 30 regions and a cap of 10, that's a ~6× speedup before counting the per-region service- level parallelism win. * perf(scheduler): parallelize provider collection loop AWS/Azure/GCP (closes #268) The provider loop in scheduler.go::CollectRecommendations ran AWS, Azure and GCP collection sequentially: total wall time = sum(per-provider) rather than max(per-provider). After #260 (async self-invoke) the user- facing 502 on POST /api/recommendations/refresh is gone, but the scheduler Lambda itself still pays the serial cost — visible as a longer "collection in flight" freshness-banner window and reduced cron-tick headroom. Provider-level fan-out under errgroup, mirroring the Azure and AWS service-loop parallelisations (#258 / #266). Each per-provider goroutine returns nil to the group so a single provider's error does not cancel its siblings — preserves the previous loop's `continue`-on-error semantics. Per-provider results are written into a map under a single mutex (the same pattern as fanOutPerAccount at scheduler.go:393). After Wait, ctx.Err() is checked and propagated so a canceled parent ctx surfaces as context.Canceled / context.DeadlineExceeded rather than being swallowed by the per-provider error-isolation goroutines (which all return nil). Deterministic merge: walk EnabledProviders in config order to assemble successfulProviders / successfulCollects / allRecommendations / totalSavings / failedProviders. Order is by config, NOT by goroutine completion order or by Go's randomised map iteration — keeps existing tests stable and the freshness-banner / notification-email content predictable. The new collectAllProviders helper extraction also keeps CollectRecommendations under the project's gocyclo gate (.golangci.yml min-complexity: 15) after the errgroup + post-Wait ctx.Err() block was added. No concurrency cap on the outer fan-out — the universe is at most 3 providers; the per-account fan-out inside each provider is still bounded by CUDLY_MAX_ACCOUNT_PARALLELISM (default 20) and per-region GCP fan-out by CUDLY_GCP_REGION_PARALLELISM (default 10, from #267). Test mocks updated: per-provider mock methods (CreateAndValidateProvider, ListCloudAccounts, GetRecommendationsClient, GetAllRecommendations) are now invoked from inside the errgroup goroutine, so they receive the errgroup-derived gctx rather than the caller's ctx. Mock setups changed from literal ctx to mock.Anything where the call sits inside the goroutine path; calls made before/after the fan-out (GetGlobalConfig, SendNewRecommendationsNotification, persistCollection writes) still pin literal ctx because they run on the caller's context unchanged. New TestScheduler_CollectRecommendations_ParallelProviders pins three contracts: 1. successfulProviders ordering matches EnabledProviders config order, not goroutine completion order or alphabetical. Asserted with a deliberately non-alphabetical config (["gcp", "aws", "azure"]) where AWS fails (ambient credentials → factory mock returns error) and Azure/GCP succeed empty (no enabled accounts) — the resulting successfulProviders must be ["gcp", "azure"], distinguishable from any of: input order, alphabetical, or arbitrary map iteration. 2. ctx cancellation propagates: a pre-cancelled ctx surfaces as context.Canceled (not a "successful but empty" CollectResult). Expected wall-time impact: CollectRecommendations drops from sum(per-provider) to max(per-provider). With AWS+Azure+GCP all enabled and Azure typically the slowest at ~15-20s post-#258, total drops from ~30-45s to ~max(15-20s).
Summary
Async recommendation refresh end-to-end (closes #257). The 502 on user-triggered refresh and on first-load was caused by the API Lambda doing the full provider fan-out synchronously and exceeding its 60s timeout — Azure alone takes >60s (#258 is the orthogonal parallelization quick-win).
This PR splits the request path so the user-facing API returns immediately:
The same async-invoke path is wired into the cold-start branch of
GET /api/recommendationsso first-load no longer blocks on a 60s provider fan-out.Bookkeeping additions
StoreInterface(Postgres + 5 test mocks across analytics/api/purchase/scheduler/server packages):Migration 000047 adds
last_collection_started_atonrecommendations_state(default NULL).Folded-in followups (originally listed as out-of-scope)
After landing the backend, I folded in the deploy + UX pieces in a second commit so the PR is end-to-end functional and verifiable in deployed mode:
Terraform (
terraform/modules/compute/aws/lambda/main.tf):aws_iam_role_policy.async_self_invoke— grantslambda:InvokeFunctionscoped to the API Lambda's own ARNSCHEDULER_LAMBDA_ARNenv var derived fromaws_partition+aws_caller_identity+var.region+var.stack_name(avoids self-reference cycle onaws_lambda_function.main.arn)aws_partition.current+aws_caller_identity.currentFrontend:
RefreshRecommendationsResultandRecommendationsFreshnesstypes updated for the new payload shapes (legacy fields kept optional during transition so existing fixtures still type-check)freshness.tsrefresh-click handler now polls/recommendations/freshnessevery 5 s after the POST returns, waiting forlast_collection_started_atto clear (or forlast_collected_atto advance paststarted_at). 10-min safety cap.Test plan
go vet ./...cleango test ./internal/...— 3,614 tests passnpm test(frontend) — 1,453 tests passtsc --noEmitcleanCloses #257.