Skip to content

fix(aws): fetch every Cost Explorer (term, payment) combo (#188)#195

Merged
cristim merged 6 commits into
feat/multicloud-web-frontendfrom
fix/issue-188-aws-1yr-recs
Apr 29, 2026
Merged

fix(aws): fetch every Cost Explorer (term, payment) combo (#188)#195
cristim merged 6 commits into
feat/multicloud-web-frontendfrom
fix/issue-188-aws-1yr-recs

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented Apr 29, 2026

Summary

Closes #188. Two-commit fix:

  1. fix(aws): fetch both 1yr and 3yr Cost Explorer terms (#188) — root cause for the filed bug.
  2. feat(aws): also fan out across all 3 Cost Explorer payment options (#188) — same root cause shape, different parameter; landed on user request so the UI's payment-option filter is non-empty.

Why two changes in one PR

PR #189 made recommendation IDs unique enough to keep variants from colliding downstream — but the AWS client was only ever asking Cost Explorer for one specific (term, payment) cell. GetReservationPurchaseRecommendation requires both TermInYears and PaymentOption per request and returns recs only for that single (term, payment) — there is no "give me all of them" mode. The old call hardcoded ("3yr", "partial-upfront"), so every other combo simply never reached the scheduler regardless of how unique the IDs were.

Symptom #188 (only 3-year recs visible) is the term axis. The payment axis has the same shape, with the same hardcoded-default origin and the same downstream wiring already in place.

Fix

GetRecommendationsForService now iterates the Cartesian product of two package constants:

defaultDiscoveryTerms          = ["1yr", "3yr"]
defaultDiscoveryPaymentOptions = ["all-upfront", "partial-upfront", "no-upfront"]

→ 6 Cost Explorer calls per service, 5 services = 30 calls per refresh.

Per-call CE errors are tolerated and skipped (matches GetAllRecommendations's per-service tolerance); only an error where every (term, payment) combo fails is propagated.

Downstream wiring is already correct

After PR #189 + migration 000042:

  • parser_ri.go:42 tags rec.Term = params.Term and rec.PaymentOption = params.PaymentOption.
  • scheduler.go:856-858 builds the rec ID from (provider | account | service | region | resource_type | engine | term | payment_option).
  • recommendations_natural_key_idx (migration 000042) is the matching 8-tuple unique index.
  • The frontend column-filter already exposes both term and payment-option as facets.

So all 6 variants land as distinct DB rows and render as distinct UI rows for free.

Test plan

  • TestGetRecommendationsForService: now expects 6 recs and asserts the full {1yr,3yr} × {all-upfront,partial-upfront,no-upfront} combo set is present.
  • TestGetRecommendationsForService_QueriesEveryCombo (new — replaces the term-only _QueriesBothTerms regression): captures the per-call CE input slice via a new riCalls field on the mock and asserts every one of the 6 (term, payment) pairs was requested. Pins the regression at the API-call layer so a future change that quietly drops a combo fails the test even if the surviving recs parse correctly.
  • go test ./... clean from repo root and providers/aws/.
  • Pre-commit hooks (gosec, go vet, gomod tidy, …) clean on both commits.
  • Manual UI verification (post-merge): pick an account with known recs in Cost Explorer; refresh the Recommendations page; confirm both 1 Year and 3 Years show in the term filter and all three of all-upfront/partial-upfront/no-upfront show in the payment-option filter.

Cost / cadence note

Recommendation collection runs at the daily 2 AM cron (recommendation_schedule = "0 2 * * *") plus an opportunistic refresh capped to once per 6h when users hit the page (defaultCacheTTL = 6 * time.Hour, scheduler.go:99). At ~5 refreshes/day worst case:

  • Old: 5 services × 1 combo × 5 refreshes/day ≈ 750 CE requests/month/account — just inside the 1,000-request free tier.
  • New: 5 services × 6 combos × 5 refreshes/day ≈ 4,500 CE requests/month/account — ~3,500 above free tier, ~$35/month/account at $0.01/request.

Trade-off accepted: the user-visible "every term × payment is selectable" win is worth ~$35/account/month for active accounts; idle accounts pay nothing extra (no opportunistic refreshes fire).

🤖 Generated with claude-flow

Summary by CodeRabbit

  • New Features

    • AWS recommendations now aggregate results across all supported reservation term and payment option combinations for broader coverage.
  • Bug Fixes

    • Improved resilience: continues when individual provider queries fail and reports an error only if all variants fail; aborts immediately on cancellation or timeout.
  • Tests

    • Added tests to verify all term/payment combinations are requested and that cancellation short-circuits before further calls.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@cristim has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 27 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3913c5d7-29f7-4569-b17b-ada9bd433c3f

📥 Commits

Reviewing files that changed from the base of the PR and between fab78c7 and 00a2835.

📒 Files selected for processing (4)
  • internal/database/postgres/migrations/000043_recommendations_add_engine_to_key.down.sql
  • internal/database/postgres/migrations/000043_recommendations_add_engine_to_key.up.sql
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

GetRecommendationsForService now iterates the Cartesian product of discovery terms and payment options, making one Cost Explorer call per (term,payment) pair and aggregating labeled results. Per-combination errors are tolerated; context cancellation short-circuits immediately. GetAllRecommendations propagates context cancellation and stops further processing.

Changes

Cohort / File(s) Summary
AWS recommendations client
providers/aws/recommendations/client.go
Replaces the single Cost Explorer request with iteration over defaultDiscoveryTerms × defaultDiscoveryPaymentOptions, issuing up to 6 calls per service, aggregating recommendations, labeling them with Term/PaymentOption, tolerating per-combination errors, and short-circuiting on ctx.Err().
Tests for AWS recommendations
providers/aws/recommendations/client_test.go
Updates mock to record each GetReservationPurchaseRecommendation input; updates expectations to the full 2×3 Cartesian product (6 calls/results); adds a test verifying exactly 6 requests and a test that a canceled context returns context.Canceled without making API calls.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant AWSClient
  participant CostExplorer
  participant Aggregator

  Caller->>AWSClient: GetRecommendationsForService(ctx, service)
  AWSClient->>AWSClient: for each term ∈ Terms\n for each payment ∈ Payments
  AWSClient->>CostExplorer: GetReservationPurchaseRecommendation(term,payment)
  alt success
    CostExplorer-->>AWSClient: recommendations
    AWSClient->>Aggregator: append labeled recommendations
  else api error
    CostExplorer-->>AWSClient: error
    AWSClient->>AWSClient: record error, continue
  end
  AWSClient->>AWSClient: if ctx canceled -> return ctx.Err()
  AWSClient-->>Caller: aggregated recommendations (or error if all failed)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

effort/m

Poem

I’m a rabbit in the code, hopping through each pair,
Terms and payments multiply — I fetch them everywhere.
Six tiny recs I gather, labeled tidy and bright,
If the context says “halt!” — I pause mid-flight.
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fetching every Cost Explorer (term, payment) combination instead of just one hardcoded pair.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #188: includes Term and PaymentOption in recommendation identity, fetches both 1yr and 3yr terms, handles multiple payment options, implements context cancellation short-circuiting, and includes comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are directly related to the objective of fetching Cost Explorer recommendations across multiple term and payment-option combinations; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-188-aws-1yr-recs

Review rate limit: 0/5 reviews remaining, refill in 8 minutes and 27 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 29, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
providers/aws/recommendations/client_test.go (1)

344-374: Add a mixed outcome test (one term fails, one succeeds).

The production logic now tolerates per-term failures; add a test that verifies successful term results are returned when the other term errors, and a separate case where both fail returns error.

🧪 Suggested additional test shape
+func TestGetRecommendationsForService_OneTermFails_OtherSucceeds(t *testing.T) {
+	// Configure mock to fail first RI call and succeed second.
+	// Assert: err == nil, returned recs include exactly one term's results.
+}
+
+func TestGetRecommendationsForService_AllTermsFail_ReturnsError(t *testing.T) {
+	// Configure mock to fail both RI calls.
+	// Assert: err != nil, recs == nil.
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@providers/aws/recommendations/client_test.go` around lines 344 - 374, Add two
new tests alongside TestGetRecommendationsForService_QueriesBothTerms: one where
mockCostExplorerAPI is configured to return an error for one TermInYears and a
successful GetReservationPurchaseRecommendationOutput for the other, asserting
that client.GetRecommendationsForService returns the successful recommendations
and no error; and another where the mock returns errors for both terms and
assert that client.GetRecommendationsForService returns an error. Reuse
NewClientWithAPI, mockCostExplorerAPI, and inspect mockAPI.riCalls and
mockAPI.riRecommendations (or riErrors) to simulate per-term outcomes and verify
the client tolerates a single-term failure but fails when both terms error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@providers/aws/recommendations/client.go`:
- Around line 123-133: The loop that calls c.GetRecommendations treats all
errors as skippable and only returns an error if every term failed; update the
handling around the c.GetRecommendations(ctx, params) call so that if err
indicates context cancellation/timeout (errors.Is(err, context.Canceled) or
errors.Is(err, context.DeadlineExceeded) or ctx.Err() != nil) you immediately
return nil, err (propagate the canceled/expired context) instead of continuing,
otherwise continue to collect successes as before; reference the
c.GetRecommendations call, lastErr and successCount logic and ensure the final
all-term failure message for service remains unchanged for non-context errors.

---

Nitpick comments:
In `@providers/aws/recommendations/client_test.go`:
- Around line 344-374: Add two new tests alongside
TestGetRecommendationsForService_QueriesBothTerms: one where mockCostExplorerAPI
is configured to return an error for one TermInYears and a successful
GetReservationPurchaseRecommendationOutput for the other, asserting that
client.GetRecommendationsForService returns the successful recommendations and
no error; and another where the mock returns errors for both terms and assert
that client.GetRecommendationsForService returns an error. Reuse
NewClientWithAPI, mockCostExplorerAPI, and inspect mockAPI.riCalls and
mockAPI.riRecommendations (or riErrors) to simulate per-term outcomes and verify
the client tolerates a single-term failure but fails when both terms error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 968ee110-d979-4d03-b7be-4d830088a376

📥 Commits

Reviewing files that changed from the base of the PR and between b376822 and 0bb7d94.

📒 Files selected for processing (2)
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go

Comment thread providers/aws/recommendations/client.go Outdated
@cristim cristim changed the title fix(aws): fetch both 1yr and 3yr Cost Explorer terms (#188) fix(aws): fetch every Cost Explorer (term, payment) combo (#188) Apr 29, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 29, 2026

Updated: extended to also fan out across the three payment options (all-upfront / partial-upfront / no-upfront) in addition to the term loop, so the UI's payment-option filter is no longer empty. Same root cause shape, same downstream wiring (already correct after #189 + migration 000042). Cost note in the PR body — corrected the per-refresh-cadence math after re-checking the actual cron (0 2 * * *) and 6h opportunistic-refresh TTL.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

@cristim, understood — the extended fan-out across payment options (all-upfront / partial-upfront / no-upfront) on top of the term loop is a natural follow-up to the same root-cause fix. Triggering a fresh review of the updated changes now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
providers/aws/recommendations/client.go (1)

145-147: ⚠️ Potential issue | 🟠 Major

Propagate canceled/expired context immediately instead of continuing.

On Line 145, canceled/deadline errors are treated like skippable combo failures. If earlier combos succeeded, this can return partial data with nil error after cancellation. Please short-circuit on context termination.

Suggested fix
 			recs, err := c.GetRecommendations(ctx, params)
 			if err != nil {
+				if ctx.Err() != nil {
+					return nil, ctx.Err()
+				}
 				lastErr = err
 				continue
 			}

Also applies to: 153-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@providers/aws/recommendations/client.go` around lines 145 - 147, Inside the
loop that currently does "if err != nil { lastErr = err; continue }" (the block
that assigns to lastErr), do not treat context cancellation or deadline errors
as skippable: detect context cancellation by checking errors.Is(err,
context.Canceled) || errors.Is(err, context.DeadlineExceeded) (or ctx.Err() !=
nil) and immediately return that error (short-circuit) instead of setting
lastErr and continuing; for non-context errors keep the existing behavior (set
lastErr and continue). Apply the same change to the similar error-handling block
that currently appears around lines 153-156.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@providers/aws/recommendations/client.go`:
- Around line 145-147: Inside the loop that currently does "if err != nil {
lastErr = err; continue }" (the block that assigns to lastErr), do not treat
context cancellation or deadline errors as skippable: detect context
cancellation by checking errors.Is(err, context.Canceled) || errors.Is(err,
context.DeadlineExceeded) (or ctx.Err() != nil) and immediately return that
error (short-circuit) instead of setting lastErr and continuing; for non-context
errors keep the existing behavior (set lastErr and continue). Apply the same
change to the similar error-handling block that currently appears around lines
153-156.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eaa9ac8b-7527-4b82-8894-5ad808f853f0

📥 Commits

Reviewing files that changed from the base of the PR and between 0bb7d94 and 6dc6185.

📒 Files selected for processing (2)
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go

cristim added a commit that referenced this pull request Apr 29, 2026
Addresses CodeRabbit's actionable finding on PR #195: the
(term, payment) loop in GetRecommendationsForService was treating a
canceled / deadline-exceeded ctx as a per-combo failure to be
tolerated. If earlier combos succeeded before cancellation, the
function would return partial data with a nil error; if all combos
ran after cancellation, it would burn 6× the wasted Cost Explorer
attempts and aggregate them into "all variants failed", masking
the real cancellation reason.

Fix: after each c.GetRecommendations(ctx, params) error, check
ctx.Err() first. If non-nil, return immediately so the caller sees
the ctx error verbatim. Per-combo errors (throttle, 5xx) keep the
existing skip-and-continue tolerance.

Same pattern applied to GetAllRecommendations's per-service block,
which had the identical risk: a canceled ctx during one service's
fan-out would silently skip to the next service rather than
short-circuiting.

Test:
  TestGetRecommendationsForService_ContextCancelShortCircuits —
  pre-cancel a ctx, mock CE to return context.Canceled on every
  call, assert the function (a) returns context.Canceled verbatim
  via errors.Is and (b) issues at most one CE call before bailing
  (rather than marching through all 6 combos).
@cristim cristim added bug Something isn't working triaged Item has been triaged priority/p1 Next up; this sprint severity/high Significant harm urgency/this-sprint Within the current sprint impact/all-users Affects every user impact/many Affects most users effort/s Hours type/bug Defect labels Apr 29, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 29, 2026

Addressed CodeRabbit's actionable finding: the (term, payment) loop now short-circuits on ctx.Err() != nil (canceled / deadline-exceeded) instead of treating it as a per-combo failure. Same pattern applied to GetAllRecommendations's per-service block — same risk shape. New regression test TestGetRecommendationsForService_ContextCancelShortCircuits pre-cancels the ctx and asserts (a) the function returns context.Canceled verbatim via errors.Is and (b) at most one Cost Explorer call was attempted (no fan-out-then-aggregate after cancellation).

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

@cristim, the context-cancellation short-circuit is a solid fix — propagating ctx.Err() immediately rather than aggregating it as a per-combo failure prevents the fan-out from continuing on a cancelled/deadline-exceeded context, which would otherwise waste CE quota and delay error surfacing. Applying the same pattern to GetAllRecommendations is the right call given the identical risk shape. The regression test covering both the errors.Is(err, context.Canceled) assertion and the at-most-one-call bound looks thorough. Triggering a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@providers/aws/recommendations/client_test.go`:
- Around line 414-421: The test currently uses require.ErrorIs which allows
wrapped errors and only checks riCalls <= 1; change the assertion to require or
assert that err equals context.Canceled verbatim (e.g., assert.Equal(t,
context.Canceled, err)) to ensure GetRecommendationsForService returns the exact
ctx.Err(); additionally, strengthen the loop-cancellation check by asserting the
RateLimiter was not invoked (check mockAPI.riCalls == 0) to ensure cancellation
short-circuits before any Cost Explorer/RateLimiter work (referencing
GetRecommendationsForService, mockAPI.riCalls and RateLimiter.Wait).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61634598-a8b3-4fda-8307-8ed47068b9d1

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc6185 and 4df421d.

📒 Files selected for processing (2)
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • providers/aws/recommendations/client.go

Comment thread providers/aws/recommendations/client_test.go Outdated
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 29, 2026

Addressed the latest CodeRabbit pass in commit fab78c7: pre-canceled contexts now short-circuit before the Cost Explorer request path, and the regression test asserts the exact context.Canceled error plus zero RI calls.\n\n@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

@cristim, understood — moving the ctx.Err() check to before the Cost Explorer request path is a clean improvement: a pre-cancelled context now guarantees zero CE calls rather than tolerating up to one, and tightening the regression assertion to match makes the test more precise and self-documenting. Triggering a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added 4 commits April 29, 2026 23:23
Closes #188.

PR #189 made recommendation IDs unique enough to keep 1yr and 3yr
variants from colliding downstream — but the AWS client never asked
Cost Explorer for 1yr recs in the first place. The discovery flow
hardcoded `Term: "3yr"` in providers/aws/recommendations/client.go,
and Cost Explorer's GetReservationPurchaseRecommendation requires a
`TermInYears` enum on each request and returns recs only for that
specific term — there is no "give me both" mode.

So even after #189, the user still saw only 3-year-term AWS recs:
the upstream call simply never fetched the 1yr variants.

Fix: GetRecommendationsForService now iterates over a small package
constant `defaultDiscoveryTerms = ["1yr", "3yr"]` and concatenates the
returned recs. Per-term Cost Explorer errors are tolerated and skipped
(matching GetAllRecommendations's per-service tolerance) so a single
throttle on one term doesn't suppress the other; only an error where
*every* term fails is propagated.

Downstream: the parser already tags `rec.Term = params.Term`, the
scheduler ID encoding from #189 already includes term, and the DB
natural-key index from migration 000042 already includes term — no
further changes needed for the two recs to land in distinct DB rows
and render as distinct rows in the UI.

Tests:
  - TestGetRecommendationsForService: now expects 2 recs (one per
    term), with terms ["1yr", "3yr"] present.
  - TestGetRecommendationsForService_QueriesBothTerms (new):
    captures the per-call CostExplorer input slice via a new
    `riCalls` field on the mock and asserts both
    types.TermInYearsOneYear and types.TermInYearsThreeYears were
    requested. This pins the regression at the API-call layer so a
    future change that quietly drops one term fails the test even
    if the parser tagging happens to produce two recs for other
    reasons.
  - TestGetAllRecommendations / TestGetAllRecommendations_SomeServicesFail:
    unchanged assertions still hold (5 services × 2 terms = 10
    parsed-and-warned attempts; the test only asserts NotEmpty).

Verified `go test ./...` passes from the repo root and from the
providers/aws submodule.

API-call cost: doubles the Cost Explorer call count for the
recommendations refresh (5 services × 2 terms = 10 calls vs 5
previously). The existing rate limiter and 100ms inter-service
sleep handle this; refresh runs at the ~15 min cadence already
budgeted for.
)

Builds on the term loop landed in the prior commit. Same root cause
(Cost Explorer requires the parameter on each request and returns recs
only for that single value), same fix shape (replace the hardcoded
default with a small package-level constant slice and iterate). The
prior commit covered TermInYears; this one covers PaymentOption.

Before:
  GetRecommendationsForService → 1 call per service at the hardcoded
  ("3yr", "partial-upfront") cell. After PR #189's ID-uniqueness fix
  the absent variants couldn't even surface from the cache; from the
  user's perspective the "all upfront" / "no upfront" filter values
  were always empty.

After:
  GetRecommendationsForService iterates the Cartesian product of
    defaultDiscoveryTerms        = ["1yr", "3yr"]
    defaultDiscoveryPaymentOptions = ["all-upfront", "partial-upfront", "no-upfront"]
  → 6 calls per service. Per-call CE errors are tolerated and skipped
  (matches GetAllRecommendations's per-service tolerance); only an
  error where every (term, payment) combo fails is propagated.

Downstream wiring is already correct:
  - Parser tags rec.PaymentOption = params.PaymentOption (parser_ri.go:42).
  - Scheduler ID encoding includes PaymentOption (scheduler.go:856-858 — PR #189).
  - DB natural-key index includes payment_option (migration 000042).
  - Frontend column-filter already exposes payment-option as a facet.

So the three payment variants land as distinct DB rows and render as
distinct UI rows for free.

Tests:
  - TestGetRecommendationsForService: now expects 6 recs (one per
    {term} × {payment} combo) and asserts the full combo set is
    present.
  - TestGetRecommendationsForService_QueriesBothTerms →
    TestGetRecommendationsForService_QueriesEveryCombo: regression
    expanded to assert all 6 (term, payment) pairs were requested
    against Cost Explorer, so a future change that quietly drops a
    combo fails the test even if the surviving recs parse correctly.

Cost note: at the actual scheduler cadence (daily 2 AM cron + at most
one opportunistic refresh per 6h window when users are active, so
~5 refreshes/day per account), 5 services × 6 combos × 5 refreshes/day
≈ 4,500 Cost Explorer requests/month/account. That's ~3,500 above
the 1,000-request free tier, or ~$35/month/account at $0.01/request.
Worth the user-visible "every term × payment is selectable" win.
Addresses CodeRabbit's actionable finding on PR #195: the
(term, payment) loop in GetRecommendationsForService was treating a
canceled / deadline-exceeded ctx as a per-combo failure to be
tolerated. If earlier combos succeeded before cancellation, the
function would return partial data with a nil error; if all combos
ran after cancellation, it would burn 6× the wasted Cost Explorer
attempts and aggregate them into "all variants failed", masking
the real cancellation reason.

Fix: after each c.GetRecommendations(ctx, params) error, check
ctx.Err() first. If non-nil, return immediately so the caller sees
the ctx error verbatim. Per-combo errors (throttle, 5xx) keep the
existing skip-and-continue tolerance.

Same pattern applied to GetAllRecommendations's per-service block,
which had the identical risk: a canceled ctx during one service's
fan-out would silently skip to the next service rather than
short-circuiting.

Test:
  TestGetRecommendationsForService_ContextCancelShortCircuits —
  pre-cancel a ctx, mock CE to return context.Canceled on every
  call, assert the function (a) returns context.Canceled verbatim
  via errors.Is and (b) issues at most one CE call before bailing
  (rather than marching through all 6 combos).
Check context cancellation before each Cost Explorer recommendation
variant request so a pre-canceled context returns context.Canceled
without touching the API path.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
providers/aws/recommendations/client.go (1)

184-203: Avoid returning partial results on cancellation.

This branch is terminal, so returning allRecommendations alongside ctx.Err() makes the canceled path look partially successful to any future caller. Returning nil, ctx.Err() keeps the contract unambiguous.

Suggested fix
-			if ctx.Err() != nil {
-				return allRecommendations, ctx.Err()
-			}
+			if ctx.Err() != nil {
+				return nil, ctx.Err()
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@providers/aws/recommendations/client.go` around lines 184 - 203, The loop
that aggregates recommendations calls GetRecommendationsForService and on
context cancellation returns allRecommendations along with ctx.Err(), which can
mislead callers into thinking partial results are valid; update both return
sites in the services loop (the early-cancel branch inside "if ctx.Err() != nil"
and the select branch after the time.After) to return nil, ctx.Err() instead of
allRecommendations, ctx.Err() so cancellations are unambiguous (change the
returns that reference allRecommendations to return nil).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@providers/aws/recommendations/client.go`:
- Around line 184-203: The loop that aggregates recommendations calls
GetRecommendationsForService and on context cancellation returns
allRecommendations along with ctx.Err(), which can mislead callers into thinking
partial results are valid; update both return sites in the services loop (the
early-cancel branch inside "if ctx.Err() != nil" and the select branch after the
time.After) to return nil, ctx.Err() instead of allRecommendations, ctx.Err() so
cancellations are unambiguous (change the returns that reference
allRecommendations to return nil).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8395007d-9ae0-4311-89f0-d9c19e2fcae2

📥 Commits

Reviewing files that changed from the base of the PR and between 4df421d and fab78c7.

📒 Files selected for processing (2)
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go

Move the recommendations engine-key migration from 000042 to 000043 so it no
longer collides with the retry-linkage migration now present on the PR base.
@cristim cristim force-pushed the fix/issue-188-aws-1yr-recs branch from fab78c7 to 7f7fd02 Compare April 29, 2026 21:24
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 29, 2026

Addressed the migration-number collision from the updated base branch: rebased onto current feat/multicloud-web-frontend and renumbered 000042_recommendations_add_engine_to_key to 000043_recommendations_add_engine_to_key so it no longer conflicts with 000042_purchase_executions_retry_linkage. Verification: pre-commit migration conflict check passed; go test ./internal/database/postgres/migrations ./internal/config ./internal/scheduler; from providers/aws, go test ./recommendations. @coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

@cristim, understood — renumbering 000042_recommendations_add_engine_to_key000043_recommendations_add_engine_to_key is the right resolution: it keeps the sequential gap-free ordering, removes the collision with 000042_purchase_executions_retry_linkage, and the pre-commit migration conflict check confirms no remaining issues. Triggering a fresh review of the rebased branch now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Return nil with ctx.Err from GetAllRecommendations cancellation paths so
callers cannot accidentally treat partial results as valid.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
providers/aws/recommendations/client.go (1)

117-170: Consider adding runtime metrics for combo fan-out behavior.

Given this now drives up to 6 CE calls per service (and is invoked from scheduler refresh paths), emitting per-combo success/error counters and latency histograms would make throttling and timeout regressions much easier to detect early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@providers/aws/recommendations/client.go` around lines 117 - 170, Add runtime
metrics around each Cost Explorer combo call inside
GetRecommendationsForService: record a per-combo latency histogram and increment
success/error counters for the specific (term, payment) combo when calling
c.GetRecommendations; include tags/labels for service, term and payment option.
Instrument right before/after the GetRecommendations call (measure start time,
observe duration on return) and increment a success counter on success (where
successCount++ and append occur) and an error counter on each non-context error
(where lastErr is set and continue). Ensure a metric is also incremented if
ctx.Err() causes the function to short-circuit so timeouts/cancels are visible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@providers/aws/recommendations/client.go`:
- Around line 117-170: Add runtime metrics around each Cost Explorer combo call
inside GetRecommendationsForService: record a per-combo latency histogram and
increment success/error counters for the specific (term, payment) combo when
calling c.GetRecommendations; include tags/labels for service, term and payment
option. Instrument right before/after the GetRecommendations call (measure start
time, observe duration on return) and increment a success counter on success
(where successCount++ and append occur) and an error counter on each non-context
error (where lastErr is set and continue). Ensure a metric is also incremented
if ctx.Err() causes the function to short-circuit so timeouts/cancels are
visible.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78e4afea-a25a-4ce2-8894-615c6665abaa

📥 Commits

Reviewing files that changed from the base of the PR and between fab78c7 and 7f7fd02.

📒 Files selected for processing (4)
  • internal/database/postgres/migrations/000043_recommendations_add_engine_to_key.down.sql
  • internal/database/postgres/migrations/000043_recommendations_add_engine_to_key.up.sql
  • providers/aws/recommendations/client.go
  • providers/aws/recommendations/client_test.go

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 29, 2026

Addressed the latest cancellation nitpick: GetAllRecommendations now returns nil, ctx.Err() from both cancellation paths instead of returning partial accumulated results. Verification: from providers/aws, go test ./recommendations; pre-commit Go hooks passed. @coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

@cristim, returning nil, ctx.Err() from both cancellation paths in GetAllRecommendations is the correct final form — discarding partial results on cancellation avoids handing callers a potentially inconsistent slice while signalling the exact cancellation reason. This is fully consistent with Go's context propagation conventions and mirrors the short-circuit already applied to GetRecommendationsForService. Triggering a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit dce96ca into feat/multicloud-web-frontend Apr 29, 2026
4 checks passed
cristim added a commit that referenced this pull request Apr 30, 2026
…205) (#206)

* fix(db): idempotently restore retry_execution_id column dropped by migration drift (closes #204, closes #205)

Production /api/purchases/* surfaced 500 errors after PR #161 deployed,
with the underlying CloudWatch error:

    ERROR: column "retry_execution_id" does not exist (SQLSTATE 42703)

User-facing symptoms (dashboard "Upcoming Scheduled Purchases" panel):

  * "Failed to load purchase details: Internal server error"  (#205)
  * "Failed to cancel purchase"                                (#204)

Both buttons hit `Handler.GetExecutionByID` whose SELECT references the
retry_execution_id and retry_attempt_n columns introduced by PR #168.

Root cause: migration version drift.

  * PR #189 added 000042_recommendations_add_engine_to_key earlier on
    the same day. Deployed Lambdas applied it as version 42.
  * PR #168 then added 000042_purchase_executions_retry_linkage,
    colliding on version 42.
  * PR #195 resolved the on-disk duplicate by renaming RECOMMENDATIONS
    to 000043 — the wrong direction. Deployed DBs already had
    schema_migrations.version=42 marked as the engine content, so on
    the next migrate.Up() they:
      - skipped on-disk 000042 (now retry_linkage) entirely (DB
        version >= 42 already)
      - applied on-disk 000043 (now engine again) as a no-op via the
        ADD COLUMN IF NOT EXISTS guards already in that file
    Net effect: the retry_execution_id / retry_attempt_n columns
    were never added on prod, but the Go code expects them.

Forward-only corrective. The new 000044 replays 000042's schema
additions wrapped in IF NOT EXISTS / CREATE INDEX IF NOT EXISTS so it's
a no-op on any DB that already has the columns (fresh deploys,
manually-fixed envs) and additive on the broken ones. We don't try to
repair schema_migrations history — golang-migrate only cares about the
high-water version, and we can't tell from outside whether a given
environment skipped 42 or applied 43 twice.

The migration up + down sql files document the incident and the
edge cases inline so future readers don't have to re-derive the
history from the git log.

Closes #204
Closes #205

* fix(db): repair partially-fixed DBs in 000044, not just no-op (CR pass 1)

CR pass 1 on PR #206 flagged that the original 000044 only added columns
when missing — a manually-fixed DB with the column but no FK, or with
retry_attempt_n nullable, would still be broken after this migration.

Rewrite as five idempotent steps that each check pg_catalog /
information_schema for the precise piece they own, so every reachable
partial state is repaired:

  1. Add retry_execution_id column if missing.
  2. Add the FK on retry_execution_id -> execution_id if no FK on that
     column exists. Match by column relationship (pg_constraint join
     pg_attribute) NOT by constraint name, so a fresh-deploy DB whose
     000042 ran cleanly with the auto-generated
     `purchase_executions_retry_execution_id_fkey` is recognised as
     already-FK'd and we don't add a second.
  3. Add retry_attempt_n column if missing (as nullable, so step 4 can
     backfill before flipping to NOT NULL).
  4. Backfill any NULLs and SET DEFAULT 0 / SET NOT NULL
     unconditionally — Postgres treats those as no-ops when the
     constraint is already as requested.
  5. Create the partial index if missing.

Down stays simple: DROP COLUMN ... CASCADE removes the FK regardless of
its name.

Comments inline document each reachable partial state and why each step
is safe to re-run. Migration is a no-op on healthy DBs, fully repairing
on broken ones.
@cristim cristim deleted the fix/issue-188-aws-1yr-recs branch April 30, 2026 14:00
cristim added a commit that referenced this pull request May 3, 2026
 #224) (#231)

* fix(frontend/recs): enforce one-variant-per-cell radio selection (closes #224)

After PR #195 (issue #188) the recommendations refresh fans out across
2 terms × 3 payments per `(provider, account, service, region,
resource_type, engine)` cell — up to 6 alternative rec rows per
physical resource. They're alternatives, not additions, but the
checkboxes had no mutual-exclusion logic, producing wrong purchase
intent in two ways:

1. **Manual checking**: a user could check `1yr/all-upfront` AND
   `3yr/no-upfront` for the same EC2 m5.large in the same account.
   Both fed into the purchase plan → double commitment for one
   resource.

2. **`select-all` was the destructive form**: every visible row got
   added, so 3 cells × 6 variants = 18 commitments instead of the
   3 the user expected when clicking "buy what's recommended". 6× the
   intended spend per cell.

# What changes

- New `cellKey(rec)` helper returns the
  `(provider, cloud_account_id, service, region, resource_type, engine)`
  prefix (same as the scheduler ID encoding from #189, minus the
  `(term, payment)` suffix). Recs sharing this key are alternatives
  for the same physical resource.

- Per-row checkbox change handler now enforces radio behaviour: on
  check, scan the visible-recs list for any sibling in the same cell
  that's already selected, deselect it FIRST, then add the new rec.
  Cells are independent — selecting in cell X doesn't touch cell Y's
  selection.

- `select-all` handler rewritten: clear current selection, then call
  new `pickBestVariantPerCell(recs)` which groups by cell and picks
  the variant with the highest **effective monthly savings**:
  `effective = savings - (upfront_cost / (term * 12))`. Amortizing
  the upfront over the commitment term means a 3yr/all-upfront with
  a huge lump-sum doesn't beat a 1yr/no-upfront just on raw
  `savings`. Sibling issue #223 will replace this tiebreaker with
  "matches resolved GlobalConfig.DefaultTerm + DefaultPayment" when
  it lands; until then, amortized savings is the right deterministic
  default.

# Out of scope (deliberate)

- **Native `<input type="radio" name="cell-X">` markup**. Per the
  issue: "Designer call — if cell-grouping (sibling issue #226)
  lands first, radios become visually correct." Stayed with
  checkboxes-with-radio-behaviour for this PR; markup switch waits
  for cell-grouping visual.
- **Default-select per cell from settings** (sibling #223).

# Tests

4 new tests in `frontend/src/__tests__/recommendations.test.ts`
inside the `'Issue #224: one-variant-per-cell radio selection'`
describe block:

  (a) Manual toggle within a cell — checking variant B with variant
      A already selected: A is removed, B is added. Sibling A was
      not also added; B was not also removed.
  (b) Cross-cell independence — selecting in cell X must NOT remove
      cell Y's existing selection.
  (c) `select-all` collapses 18 → 3: 3 cells × 6 variants. After
      click, exactly 3 add calls; clearSelectedRecommendations was
      called first to drop stale state.
  (d) Tiebreaker pin — single cell with 3 variants whose
      `(savings, upfront_cost, term)` produce known amortized values
      ($200, $300, $400). The middle variant wins ($400 effective)
      despite the high-upfront variant having $1200 raw savings,
      proving the amortization is actually computed.

`tsc --noEmit` clean. `npx jest --testPathPatterns="recommendations"`
exit 0. `npm run build` (webpack production) exit 0.

* fix(frontend/recs): address CR feedback on PR#231 — term clamp + full-set sibling eviction

- pickBestVariantPerCell: fix term=0 clamp from Math.max(1, r.term*12) to
  Math.max(1, r.term||1)*12 so a malformed zero-term rec is treated as 1yr
  (12 months) not 1 month; keeps the effective-savings winner correct
  (CR actionable: 1-month fallback could skew per-cell winner selection)
- checkbox change handler: iterate state.getRecommendations() (full loaded set)
  instead of filtered `recommendations` when evicting same-cell siblings, so
  variants hidden by a column filter are also deselected, preserving the
  one-variant-per-cell contract across filter changes
  (CR actionable: hidden siblings could remain selected, breaking invariant)
- test: update issue-#224 test (a) and (b) to mock state.getRecommendations();
  add new regression test for the hidden-by-filter scenario

* fix(frontend): address issue 224 review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working effort/s Hours impact/all-users Affects every user impact/many Affects most users priority/p1 Next up; this sprint severity/high Significant harm triaged Item has been triaged type/bug Defect urgency/this-sprint Within the current sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant