Skip to content

perf(exchange): rec-driven cross-family alternatives via cached Cost Explorer recommendations#79

Merged
cristim merged 12 commits into
feat/multicloud-web-frontendfrom
perf/exchange-rec-driven-alternatives
Apr 28, 2026
Merged

perf(exchange): rec-driven cross-family alternatives via cached Cost Explorer recommendations#79
cristim merged 12 commits into
feat/multicloud-web-frontendfrom
perf/exchange-rec-driven-alternatives

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented Apr 25, 2026

Summary

  • Replaces the hand-curated peerFamilyGroups allowlist + per-recommendation DescribeReservedInstancesOfferings API call with a single region-scoped read against the cached AWS Cost Explorer purchase recommendations table that the scheduler already populates.
  • Drops peerFamilyGroups / candidateFamilies / alternativesForTarget from pkg/exchange/reshape.go; renames OfferingLookup + AnalyzeReshapingWithOfferingsPurchaseRecLookup + AnalyzeReshapingWithRecs. The existing passesDollarUnitsCheck gate stays unchanged.
  • New internal/api/exchange_lookup.go::purchaseRecLookupFromStore adapter calls store.ListStoredRecommendations(...) filtered by provider/service/region (+ optional source CloudAccount UUID for cross-account leak guard). providers/aws/services/ec2/client.go::FindConvertibleOfferings and its dedicated test file deleted (no remaining callers).
  • auto.go keeps using the base AnalyzeReshaping (no alternatives, primary target only) — auto-exchange contract is preserved.

Closes #40.

Test plan

  • go build ./... (all 3 modules)
  • go test -race -count=1 ./... for the main module — all packages green
  • go test -race -count=1 ./pkg/exchange/... — all green
  • go test -race -count=1 ./providers/aws/services/ec2/... ./providers/aws/recommendations/... — green (the only pre-existing failure in providers/aws is TestAWSProvider_GetDefaultRegion and is unrelated to this refactor)
  • New unit tests:
    • pkg/exchange/reshape_crossfamily_test.go: TestAnalyzeReshapingWithRecs_* (recommendation-driven, cold cache, dollar-units filter, lookup error, primary-target exclusion, legacy m4 family).
    • internal/api/exchange_lookup_test.go: region/account filter, empty-account degraded mode, no-recs cold cache, error propagation, UpfrontCost / (Term × 12) + MonthlyCost mapping.
  • Integration test (build-tagged) internal/api/handler_ri_exchange_integration_test.go rewritten to seed recs into the test container's recommendations table instead of mocking FindConvertibleOfferings.
  • Reshape page on the deployed Lambda URL surfaces alternatives sourced from Cost Explorer recs (verified post-deploy).

Summary by CodeRabbit

  • Performance
    • Cross-family alternatives now come from cached purchase recommendations, reducing latency and external API calls.
  • New Features
    • Alternatives include commitment duration (term) and enforce term-matching; same-family and primary targets are excluded and alternatives are sorted by effective monthly cost.
    • Empty region requests now adopt the resolved/configured region for correct scoping.
  • Bug Fixes
    • Handler returns primary target and succeeds when no recommendations exist; account-resolution errors now abort scoped lookups.
  • Refactor
    • AWS identity/config loading made more robust to avoid poisoning shared config.
  • Tests
    • Expanded unit and integration tests for mapping, ordering, term behavior, region/account scoping, and caching.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@cristim has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 14 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5aabb618-b0eb-4b32-a1dd-6c794386810e

📥 Commits

Reviewing files that changed from the base of the PR and between b877b80 and fa371a4.

📒 Files selected for processing (1)
  • internal/api/handler_test.go

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces per-recommendation EC2 offerings fan-out with a Postgres-backed purchase-recommendation lookup; maps cached Cost Explorer recommendations to exchange.OfferingOption (amortized upfront, normalization, term seconds), threads the lookup into reshape analysis, removes the peer-family allowlist and offerings API, and updates handlers/tests for rec-driven alternatives.

Changes

Cohort / File(s) Summary
New Recommendation Lookup
internal/api/exchange_lookup.go, internal/api/exchange_lookup_test.go
Adds purchaseRecLookupFromStore closure that maps store.ListStoredRecommendations[]exchange.OfferingOption via recommendationToOffering (amortize upfront cost, compute NormalizationFactor, set TermSeconds); adds resolveAWSCloudAccountID and splitInstanceType. Tests cover filter scoping, cost/term math, parsing, and error propagation.
Reshape Analysis Refactor
pkg/exchange/reshape.go, pkg/exchange/reshape_crossfamily_test.go
Removes peerFamilyGroups, candidateFamilies, and OfferingLookup flow; adds PurchaseRecLookup, AnalyzeReshapingWithRecs, and TermSeconds on RIInfo/OfferingOption. Alternatives are now built from cached recs with family/primary/term filters and existing dollar-units gating. Tests adapted for lookup-driven behavior and edge cases.
Handler & Integration Updates
internal/api/handler_ri_exchange.go, internal/api/handler_ri_exchange_integration_test.go, internal/api/handler.go, internal/api/exchange_lookup.go
Handler now resolves tenant-scoped AWS cloud-account UUID (resolveAWSCloudAccountID), defaults region from resolved AWS config when missing, selects page currency from first RI, exposes reshapeAccountResolver for tests, and wires purchaseRecLookupFromStore into AnalyzeReshapingWithRecs. Integration tests seed recs and validate scoped and unscoped lookup behavior.
EC2 Client Cleanup
providers/aws/services/ec2/client.go, providers/aws/services/ec2/client_offerings_test.go
Removes FindConvertibleOfferings and its effectiveMonthlyCost helper; deletes unit tests that validated the removed batched DescribeReservedInstancesOfferings behavior.
Misc Handler Tests
internal/api/handler_test.go
Adds regression test ensuring loadAPIKey does not poison the shared AWS config cache (loads AWS config locally per call).

Sequence Diagram(s)

sequenceDiagram
    participant H as Handler
    participant S as Store/Postgres
    participant L as PurchaseRecLookup
    participant A as AnalyzeReshapingWithRecs

    H->>H: resolveAWSCloudAccountID(ctx)
    H->>S: ListCloudAccounts(provider="aws")
    S-->>H: cloud account UUID or none

    H->>L: build purchaseRecLookupFromStore(region,currency,accountUUID)
    L-->>H: lookup closure

    H->>A: AnalyzeReshapingWithRecs(RIs, utilization, threshold, region, currency, L)
    A->>L: lookup(ctx, region, currency)
    L->>S: ListStoredRecommendations(filter: provider="aws", service="ec2", region, currency, accountIDs?)
    S-->>L: []RecommendationRecord
    L->>L: map records -> OfferingOption (amortize upfront, parse instance, set TermSeconds)
    L-->>A: []OfferingOption

    A->>A: filter offerings (exclude same-family, primary, term mismatch, $-units)
    A-->>H: ReshapeRecommendations with AlternativeTargets
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped through cached recs with glee,

Amortized upfront costs set free,
No fan‑out calls across the sky,
Postgres hummed the wiser buy,
Alternatives now sail merrily.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: replacing hand-curated peer family groups with recommendation-driven alternatives sourced from cached Cost Explorer data.
Linked Issues check ✅ Passed The pull request comprehensively addresses all coding objectives from issue #40: removed peerFamilyGroups/candidateFamilies/alternativesForTarget, implemented PurchaseRecLookup with store.ListStoredRecommendations filtering, replaced AnalyzeReshapingWithOfferings with AnalyzeReshapingWithRecs, updated handler wiring, deleted FindConvertibleOfferings, added per-instance cost normalization (divide by rec.Count), and included comprehensive unit and integration tests validating region/account/currency filtering and failure modes.
Out of Scope Changes check ✅ Passed All changes are properly scoped to the objectives: AWS SDK signature refactoring (resolveAWSAccountID, resolveAWSCallerIdentity) is necessary for fail-closed multi-tenant safety; loadAPIKey fix prevents config cache poisoning in tests; reshapeAccountResolver injection supports hermetic testing. No unrelated features or refactorings were introduced.
Docstring Coverage ✅ Passed Docstring coverage is 89.13% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/exchange-rec-driven-alternatives

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 25, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/api/handler_ri_exchange.go (1)

227-257: ⚠️ Potential issue | 🟠 Major

Normalize region before the recommendations lookup.

Line 227 keeps the raw query-string value, so if the client omits region, loadAWSConfigWithRegion can still operate with a default AWS region while AnalyzeReshapingWithRecs receives "". That makes the store read unscoped by region and can surface alternatives from unrelated regions on the reshape page.

Suggested fix
 	region := req.QueryStringParameters["region"]
 	cfg, err := h.loadAWSConfigWithRegion(ctx, region)
 	if err != nil {
 		return nil, fmt.Errorf("failed to load AWS config: %w", err)
 	}
+	if region == "" {
+		region = cfg.Region
+	}
 
 	ec2Client := h.buildReshapeEC2Client(cfg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_ri_exchange.go` around lines 227 - 257, The handler is
using the raw query-string region value which may be empty while
loadAWSConfigWithRegion returns a concrete cfg.Region; normalize the region
after loading cfg (e.g., set resolvedRegion := region; if resolvedRegion == ""
&& cfg.Region != "" then resolvedRegion = cfg.Region) and use resolvedRegion
instead of region when calling getRIUtilizationCache().getOrFetch and
exchange.AnalyzeReshapingWithRecs (and any other downstream region-scoped calls)
so the recommendations lookup and utilization cache are scoped to the actual AWS
region returned by loadAWSConfigWithRegion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/exchange_lookup.go`:
- Around line 113-129: The resolver currently hides runtime failures by
returning an empty string; change resolveAWSCloudAccountID(ctx) to return
(string, error) and propagate errors up so callers can fail closed: call
resolveAWSAccountID(ctx) and if empty return ("", nil) to indicate intentionally
unregistered, but if h.config.ListCloudAccounts(...) returns an error then
return ("", err) (and if it returns zero accounts treat as a non-match with nil
error); update callers such as purchaseRecLookupFromStore to check the error
and, on a non-nil error from resolveAWSCloudAccountID, abort/return the error
rather than treating an empty string as “omit AccountIDs”; keep the matching
loop logic (comparing a.ExternalID to awsAccountID) and return the matched a.ID
on success.

In `@pkg/exchange/reshape.go`:
- Around line 67-80: The PurchaseRecLookup contract currently loses
term/duration info so AnalyzeReshapingWithRecs cannot reject term-mismatched
alternatives; change the PurchaseRecLookup signature to return OfferingOption
values that include a term/duration field (e.g., add a Term or Duration
string/int to OfferingOption) and update all callers to populate that field,
then modify AnalyzeReshapingWithRecs to filter returned OfferingOption entries
by matching term/duration before building AlternativeTargets (ensure the term
check is applied prior to amortisation/cost comparisons so AlternativeTargets
never contains incompatible-term alternatives).

---

Outside diff comments:
In `@internal/api/handler_ri_exchange.go`:
- Around line 227-257: The handler is using the raw query-string region value
which may be empty while loadAWSConfigWithRegion returns a concrete cfg.Region;
normalize the region after loading cfg (e.g., set resolvedRegion := region; if
resolvedRegion == "" && cfg.Region != "" then resolvedRegion = cfg.Region) and
use resolvedRegion instead of region when calling
getRIUtilizationCache().getOrFetch and exchange.AnalyzeReshapingWithRecs (and
any other downstream region-scoped calls) so the recommendations lookup and
utilization cache are scoped to the actual AWS region returned by
loadAWSConfigWithRegion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 206c0346-1819-46ea-9651-e5861f2f7a28

📥 Commits

Reviewing files that changed from the base of the PR and between 006b9d3 and 1d76df7.

📒 Files selected for processing (8)
  • internal/api/exchange_lookup.go
  • internal/api/exchange_lookup_test.go
  • internal/api/handler_ri_exchange.go
  • internal/api/handler_ri_exchange_integration_test.go
  • pkg/exchange/reshape.go
  • pkg/exchange/reshape_crossfamily_test.go
  • providers/aws/services/ec2/client.go
  • providers/aws/services/ec2/client_offerings_test.go
💤 Files with no reviewable changes (2)
  • providers/aws/services/ec2/client_offerings_test.go
  • providers/aws/services/ec2/client.go

Comment thread internal/api/exchange_lookup.go Outdated
Comment thread pkg/exchange/reshape.go
cristim added a commit that referenced this pull request Apr 27, 2026
When the reshape endpoint is called without ?region=, loadAWSConfigWithRegion
resolves a default region from the AWS SDK chain but the local `region`
string stays empty. That empty value then propagates into the RI utilization
cache key and AnalyzeReshapingWithRecs lookup, which scopes neither — so
alternatives from unrelated regions could surface on the reshape page.

Adopt cfg.Region after the config load when the query string was empty so
every downstream consumer sees the same region the AWS clients are talking
to. CodeRabbit (PR #79).
cristim added a commit that referenced this pull request Apr 27, 2026
The resolver previously hid ListCloudAccounts errors behind an empty-string
return that the caller treated as "no AWS account registered, omit the
AccountIDs filter". A real DB error therefore caused the rec-driven lookup
to run unscoped and could match recommendations from another tenant.

Promote the function to (string, error):
  - ("", nil) for the intentionally-unregistered cases (no STS resolution,
    no CloudAccount rows, no matching ExternalID) — caller still treats
    this as "skip the filter" so first-run/dev keeps working.
  - ("", err) for genuine ListCloudAccounts failures — handler now aborts
    the reshape request rather than silently broadening scope.

CodeRabbit (PR #79).
cristim added a commit that referenced this pull request Apr 27, 2026
AnalyzeReshapingWithRecs surfaced cross-family alternatives without
checking that the offering's commitment term matched the source RI's,
so a 1y RI could land 3y alternatives in AlternativeTargets (and vice
versa) — wrong because AWS only allows reshape exchanges within the
same term.

Add TermSeconds to RIInfo and OfferingOption (canonical AWS RI duration
unit, matches ec2.ReservedInstance.Duration and the existing
ConvertibleRI.Duration field). The handler plumbs inst.Duration through
to RIInfo; the recs lookup converts rec.Term (years) using 31_536_000
s/year. fillAlternativesFromRecs rejects offerings whose TermSeconds
differs from the source before the dollar-units gate runs.

Both sides falling back to zero TermSeconds skips the term gate so
existing fixtures and any older callers stay green.

Regression tests:
- pkg/exchange/reshape_crossfamily_test.go: 1y source + (1y, 3y)
  alternatives → only the 1y survives; zero-term source still passes
  the legacy mixed-term offering through.
- internal/api/exchange_lookup_test.go: 1y rec → 31_536_000s, 3y rec
  → 3 × 31_536_000s, Term==0 → TermSeconds==0.

CodeRabbit (PR #79).
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/api/exchange_lookup.go (1)

141-143: ⚠️ Potential issue | 🔴 Critical

Fail closed when source AWS account resolution fails at runtime.

Line 141 treats unresolved account identity as ("", nil). Combined with Line 48 (AccountIDs omitted), transient STS/config failures become unscoped recommendation reads across accounts in the region.

🔧 Suggested direction
- awsAccountID := h.resolveAWSAccountID(ctx)
- if awsAccountID == "" {
-   return "", nil
- }
+ awsAccountID, err := h.resolveAWSAccountID(ctx)
+ if err != nil {
+   return "", fmt.Errorf("resolve source aws account for reshape scope: %w", err)
+ }
+ if awsAccountID == "" {
+   // keep as nil only for explicitly unsupported/unregistered mode
+   return "", nil
+ }

Also update resolveAWSAccountID to return (string, error) so STS/config failures are distinguishable from intentional “no scope” mode.

Also applies to: 48-50

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/exchange_lookup.go` around lines 141 - 143, Change
resolveAWSAccountID to return (string, error) instead of (string, nil) so
runtime STS/config failures are distinguishable from an intentional empty scope;
update the implementation to return a non-nil error when account resolution
fails, and update all call sites in exchange_lookup.go (e.g., the branch that
currently does `if awsAccountID == "" { return "", nil }` and the code around
the `AccountIDs` omission at the top of the file) to propagate the error and
fail closed (return the error up the call chain) rather than treating an empty
string as success/unscoped; ensure function signatures and any tests are updated
accordingly.
🧹 Nitpick comments (1)
pkg/exchange/reshape.go (1)

449-454: Use a fallback normalization factor in pricingGatePasses for defensive compatibility.

Line 453 uses src.NormalizationFactor directly. If a caller provides MonthlyCost but leaves NormalizationFactor unset, all alternatives are rejected even when src.InstanceType is parseable.

♻️ Suggested refactor
 func pricingGatePasses(src RIInfo, off OfferingOption, hasSrc bool) bool {
 	if !hasSrc || src.MonthlyCost <= 0 {
 		return true
 	}
-	return passesDollarUnitsCheck(src.NormalizationFactor, src.MonthlyCost, src.CurrencyCode, off)
+	srcNF := src.NormalizationFactor
+	if srcNF <= 0 {
+		_, size := parseInstanceType(src.InstanceType)
+		srcNF = NormalizationFactorForSize(size)
+	}
+	return passesDollarUnitsCheck(srcNF, src.MonthlyCost, src.CurrencyCode, off)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/exchange/reshape.go` around lines 449 - 454, pricingGatePasses currently
uses src.NormalizationFactor directly which can be zero/unset and wrongly reject
alternatives; modify pricingGatePasses to compute a fallback normalization
factor when src.NormalizationFactor <= 0 by attempting to derive it from
src.InstanceType (e.g. parse vCPU/memory or call an existing parseNormalization
helper) and if that fails default to 1.0, then pass that computed value to
passesDollarUnitsCheck instead of src.NormalizationFactor; update references to
RIInfo fields (NormalizationFactor, MonthlyCost, CurrencyCode, InstanceType) and
the call to passesDollarUnitsCheck in pricingGatePasses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/api/exchange_lookup.go`:
- Around line 141-143: Change resolveAWSAccountID to return (string, error)
instead of (string, nil) so runtime STS/config failures are distinguishable from
an intentional empty scope; update the implementation to return a non-nil error
when account resolution fails, and update all call sites in exchange_lookup.go
(e.g., the branch that currently does `if awsAccountID == "" { return "", nil }`
and the code around the `AccountIDs` omission at the top of the file) to
propagate the error and fail closed (return the error up the call chain) rather
than treating an empty string as success/unscoped; ensure function signatures
and any tests are updated accordingly.

---

Nitpick comments:
In `@pkg/exchange/reshape.go`:
- Around line 449-454: pricingGatePasses currently uses src.NormalizationFactor
directly which can be zero/unset and wrongly reject alternatives; modify
pricingGatePasses to compute a fallback normalization factor when
src.NormalizationFactor <= 0 by attempting to derive it from src.InstanceType
(e.g. parse vCPU/memory or call an existing parseNormalization helper) and if
that fails default to 1.0, then pass that computed value to
passesDollarUnitsCheck instead of src.NormalizationFactor; update references to
RIInfo fields (NormalizationFactor, MonthlyCost, CurrencyCode, InstanceType) and
the call to passesDollarUnitsCheck in pricingGatePasses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ca274271-69c6-4985-baa2-78fef6f57ae6

📥 Commits

Reviewing files that changed from the base of the PR and between 1d76df7 and 272a175.

📒 Files selected for processing (5)
  • internal/api/exchange_lookup.go
  • internal/api/exchange_lookup_test.go
  • internal/api/handler_ri_exchange.go
  • pkg/exchange/reshape.go
  • pkg/exchange/reshape_crossfamily_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/api/exchange_lookup_test.go

cristim added a commit that referenced this pull request Apr 27, 2026
…#79)

CodeRabbit review #2 flagged this as CRITICAL: resolveAWSCloudAccountID
still returned ("", nil) when the underlying STS call failed, which
propagated up as "no AccountIDs filter" and could leak another tenants
recommendations on a transient AWS API hiccup.

Fix the underlying primitive AND the wrapper:

- internal/api/handler.go::resolveAWSAccountID — signature changes from
  string to (string, error). Three return shapes now distinguished:
    ("", nil)        — AWS SDK config could not load (Azure/GCP host)
    ("", err)        — config loaded but STS GetCallerIdentity failed
    (accountID, nil) — success
  The non-AWS-host case stays a clean signal; only real STS failures now
  surface as errors.

- internal/api/exchange_lookup.go::resolveAWSCloudAccountID — propagate
  the STS error verbatim. Multi-tenant deployments now fail closed on
  STS hiccups instead of silently issuing an unscoped recommendations
  read.

- internal/api/handler.go::resolveSourceIdentity — keeps the previous
  best-effort semantics (drops the error explicitly with a comment).
  populateSourceAccountID already fails loud on an empty AccountID for
  the AWS-source case, so the security gate is still in place; we just
  preserve backwards compatibility for non-reshape callers.

plus pricingGatePasses NF fallback (CodeRabbit nit on PR #79):

- pkg/exchange/reshape.go::pricingGatePasses — when src.NormalizationFactor
  is zero (callers occasionally leave it unset), derive NF from
  InstanceType via NormalizationFactorForSize. Avoids the silent
  reject-everything failure mode for partial RIInfo constructions.

Tests: full internal/api + pkg/exchange suites green.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

Pushed f44b06567 — addresses both outstanding review items plus the nit:

CRITICAL (fail-closed on STS): resolveAWSAccountID signature changes string(string, error). Three distinct return shapes:

  • ("", nil) — AWS SDK config never loaded (Azure/GCP host) — legitimate "no scope" signal.
  • ("", err) — config loaded but STS GetCallerIdentity failed — fail-closed signal for security-sensitive paths.
  • (accountID, nil) — success.

resolveAWSCloudAccountID (the reshape wrapper) now propagates the STS error verbatim. Multi-tenant deploys can no longer leak alternatives across tenants on a transient STS hiccup. resolveSourceIdentity keeps best-effort semantics (drops the error explicitly with a comment) since populateSourceAccountID already fails loud on empty AccountID for the AWS-source federation case.

NIT (pricingGatePasses NF fallback): when src.NormalizationFactor <= 0, derive NF from InstanceType via NormalizationFactorForSize. Avoids silently rejecting every alternative when callers leave NF unset.

Full internal/api + pkg/exchange test suites green. Will file follow-ups for the broader tests-for-failure-paths gap and the other items raised.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

Filed four follow-up issues for items I flagged in the review but that are out of scope for this PR:

None block this PR. The CRITICAL + MAJOR + nit are all addressed in f44b06567.

cristim added 5 commits April 27, 2026 23:56
…Explorer recommendations

The reshape page used to enumerate cross-family alternatives via a
hand-curated `peerFamilyGroups` allowlist plus a per-recommendation
`DescribeReservedInstancesOfferings` API call. The recommendations
table already caches AWS Cost Explorer's `GetReservationPurchaseRecommendation`
output on every scheduler tick — pair underutilised convertibles
against that cache instead of fanning out N×M offering enumerations
on every page load.

What got deleted from `pkg/exchange/reshape.go`:
- `peerFamilyGroups` (hand-curated map; new families were invisible
  until somebody updated the file by hand)
- `candidateFamilies()` and `alternativesForTarget()` helpers
- `OfferingLookup` (per-rec lookup signature)
- `AnalyzeReshapingWithOfferings()` and `distinctCandidateTypes()` /
  `fillAlternativesFromOfferings()`

What replaces them:
- `PurchaseRecLookup func(ctx, region, currencyCode) ([]OfferingOption, error)`
- `AnalyzeReshapingWithRecs(ctx, ris, util, threshold, region, currency, lookup)`:
  ONE region-scoped lookup, then per-rec the offerings are filtered
  to cross-family options that pass the existing
  `passesDollarUnitsCheck` gate (unchanged) and sorted ascending by
  effective monthly cost.

New `internal/api/exchange_lookup.go` adapter
`purchaseRecLookupFromStore(store, accountID)`:
- Calls `store.ListStoredRecommendations(ctx,
  RecommendationFilter{Provider: "aws", Service: "ec2", Region:
  region, AccountIDs: [accountID]})`.
- Maps each `RecommendationRecord` → `OfferingOption` with
  effective monthly = `UpfrontCost / (Term × 12) + MonthlyCost`.
- Filters by source CloudAccount UUID (resolved from STS account
  ID → CloudAccount.ExternalID) so reshape can't surface another
  tenant's recs. Empty UUID = ambient-credentials degraded mode.

`providers/aws/services/ec2/client.go::FindConvertibleOfferings`
deleted along with its dedicated test file (no remaining callers).

`internal/server/handler_ri_exchange.go::convertForAutoExchange`
unchanged — auto.go uses the base `AnalyzeReshaping` (no
alternatives, primary target only) so the auto-exchange pipeline's
contract is preserved.

Tests:
- `pkg/exchange/reshape_crossfamily_test.go` rewritten:
  - Removed `TestCandidateFamilies_*` (function gone).
  - Removed base-path cross-family assertion (alternatives now belong
    exclusively to the WithRecs path).
  - Added `TestAnalyzeReshaping_BaseEmitsNoAlternatives` regression.
  - Migrated all `TestAnalyzeReshapingWithOfferings_*` →
    `TestAnalyzeReshapingWithRecs_*` (recommendation-driven,
    cold-cache, dollar-units filter, lookup error, primary-target
    exclusion, legacy m4 family).
- New `internal/api/exchange_lookup_test.go`: region/account/empty/no-recs
  filter coverage + UpfrontCost amortisation mapping.
- Integration test `internal/api/handler_ri_exchange_integration_test.go`
  switched to seeding recs into the test container's recommendations
  table instead of mocking `FindConvertibleOfferings`.

Closes #40
When the reshape endpoint is called without ?region=, loadAWSConfigWithRegion
resolves a default region from the AWS SDK chain but the local `region`
string stays empty. That empty value then propagates into the RI utilization
cache key and AnalyzeReshapingWithRecs lookup, which scopes neither — so
alternatives from unrelated regions could surface on the reshape page.

Adopt cfg.Region after the config load when the query string was empty so
every downstream consumer sees the same region the AWS clients are talking
to. CodeRabbit (PR #79).
The resolver previously hid ListCloudAccounts errors behind an empty-string
return that the caller treated as "no AWS account registered, omit the
AccountIDs filter". A real DB error therefore caused the rec-driven lookup
to run unscoped and could match recommendations from another tenant.

Promote the function to (string, error):
  - ("", nil) for the intentionally-unregistered cases (no STS resolution,
    no CloudAccount rows, no matching ExternalID) — caller still treats
    this as "skip the filter" so first-run/dev keeps working.
  - ("", err) for genuine ListCloudAccounts failures — handler now aborts
    the reshape request rather than silently broadening scope.

CodeRabbit (PR #79).
AnalyzeReshapingWithRecs surfaced cross-family alternatives without
checking that the offering's commitment term matched the source RI's,
so a 1y RI could land 3y alternatives in AlternativeTargets (and vice
versa) — wrong because AWS only allows reshape exchanges within the
same term.

Add TermSeconds to RIInfo and OfferingOption (canonical AWS RI duration
unit, matches ec2.ReservedInstance.Duration and the existing
ConvertibleRI.Duration field). The handler plumbs inst.Duration through
to RIInfo; the recs lookup converts rec.Term (years) using 31_536_000
s/year. fillAlternativesFromRecs rejects offerings whose TermSeconds
differs from the source before the dollar-units gate runs.

Both sides falling back to zero TermSeconds skips the term gate so
existing fixtures and any older callers stay green.

Regression tests:
- pkg/exchange/reshape_crossfamily_test.go: 1y source + (1y, 3y)
  alternatives → only the 1y survives; zero-term source still passes
  the legacy mixed-term offering through.
- internal/api/exchange_lookup_test.go: 1y rec → 31_536_000s, 3y rec
  → 3 × 31_536_000s, Term==0 → TermSeconds==0.

CodeRabbit (PR #79).
…#79)

CodeRabbit review #2 flagged this as CRITICAL: resolveAWSCloudAccountID
still returned ("", nil) when the underlying STS call failed, which
propagated up as "no AccountIDs filter" and could leak another tenants
recommendations on a transient AWS API hiccup.

Fix the underlying primitive AND the wrapper:

- internal/api/handler.go::resolveAWSAccountID — signature changes from
  string to (string, error). Three return shapes now distinguished:
    ("", nil)        — AWS SDK config could not load (Azure/GCP host)
    ("", err)        — config loaded but STS GetCallerIdentity failed
    (accountID, nil) — success
  The non-AWS-host case stays a clean signal; only real STS failures now
  surface as errors.

- internal/api/exchange_lookup.go::resolveAWSCloudAccountID — propagate
  the STS error verbatim. Multi-tenant deployments now fail closed on
  STS hiccups instead of silently issuing an unscoped recommendations
  read.

- internal/api/handler.go::resolveSourceIdentity — keeps the previous
  best-effort semantics (drops the error explicitly with a comment).
  populateSourceAccountID already fails loud on an empty AccountID for
  the AWS-source case, so the security gate is still in place; we just
  preserve backwards compatibility for non-reshape callers.

plus pricingGatePasses NF fallback (CodeRabbit nit on PR #79):

- pkg/exchange/reshape.go::pricingGatePasses — when src.NormalizationFactor
  is zero (callers occasionally leave it unset), derive NF from
  InstanceType via NormalizationFactorForSize. Avoids the silent
  reject-everything failure mode for partial RIInfo constructions.

Tests: full internal/api + pkg/exchange suites green.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/api/exchange_lookup.go (1)

155-165: ⚠️ Potential issue | 🔴 Critical

Don't treat a resolved-but-unregistered AWS account as “no filter”.

Lines 160-165 still return ("", nil) on “no matching ExternalID”. purchaseRecLookupFromStore interprets that as “omit AccountIDs”, which will read every account's cached recs if other AWS accounts are already registered. The zero-row bootstrap case is fine, but a resolved non-match should fail closed or suppress alternatives.

Suggested fix
 	accounts, err := h.config.ListCloudAccounts(ctx, config.CloudAccountFilter{Provider: &provider})
 	if err != nil {
 		return "", fmt.Errorf("list cloud accounts for reshape scope: %w", err)
 	}
+	if len(accounts) == 0 {
+		return "", nil
+	}
 	for _, a := range accounts {
 		if a.ExternalID == awsAccountID {
 			return a.ID, nil
 		}
 	}
-	return "", nil
+	return "", fmt.Errorf("source aws account %q is not registered", awsAccountID)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/exchange_lookup.go` around lines 155 - 165, The code currently
returns ("", nil) when an awsAccountID is resolved but no matching
account.ExternalID is found, which causes purchaseRecLookupFromStore to treat
this as "no filter" and query all accounts; instead, in the function that calls
h.config.ListCloudAccounts (the block using awsAccountID, provider, and
config.CloudAccountFilter), return a non-nil error when a resolved awsAccountID
does not match any account.ExternalID so callers (e.g.,
purchaseRecLookupFromStore) can fail closed or suppress alternatives; update the
return to something like fmt.Errorf("aws account %s not registered",
awsAccountID) and propagate that error to prevent omitting AccountIDs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/handler.go`:
- Around line 496-503: The current resolveAWSAccountID swallows all
awsconfig.LoadDefaultConfig errors by returning ("", nil); change it to only
ignore the error if the host is not AWS using sourceCloud(), and otherwise
return the load error so AWS-hosted deployments fail closed. Specifically,
inside Handler.resolveAWSAccountID (the block that sets h.awsCfg and h.awsCfgErr
via h.awsCfgOnce), after checking h.awsCfgErr use sourceCloud() to determine the
runtime cloud: if sourceCloud(ctx) indicates non-AWS return ("", nil) as before,
otherwise return ("", h.awsCfgErr) so callers like resolveAWSCloudAccountID can
handle real AWS config failures; keep existing h.awsCfg and awsCfgOnce usage
unchanged.

---

Duplicate comments:
In `@internal/api/exchange_lookup.go`:
- Around line 155-165: The code currently returns ("", nil) when an awsAccountID
is resolved but no matching account.ExternalID is found, which causes
purchaseRecLookupFromStore to treat this as "no filter" and query all accounts;
instead, in the function that calls h.config.ListCloudAccounts (the block using
awsAccountID, provider, and config.CloudAccountFilter), return a non-nil error
when a resolved awsAccountID does not match any account.ExternalID so callers
(e.g., purchaseRecLookupFromStore) can fail closed or suppress alternatives;
update the return to something like fmt.Errorf("aws account %s not registered",
awsAccountID) and propagate that error to prevent omitting AccountIDs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c3710fc-ef99-406a-83e3-fcbae4c243f6

📥 Commits

Reviewing files that changed from the base of the PR and between 272a175 and f44b065.

📒 Files selected for processing (3)
  • internal/api/exchange_lookup.go
  • internal/api/handler.go
  • pkg/exchange/reshape.go

Comment thread internal/api/handler.go Outdated
@cristim cristim force-pushed the perf/exchange-rec-driven-alternatives branch from f44b065 to c3428ad Compare April 27, 2026 22:14
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

Rebased onto current feat/multicloud-web-frontend (PRs #81, #142, #146 merged today). Resolved the conflict in internal/api/handler.go by integrating both changes: resolveAWSCallerIdentity(ctx) (accountID, partition string, err error) — keeps the AWS GovCloud/China partition logic from #146 and adds the fail-closed error propagation from this PR. Tests green on both internal/api and pkg/exchange after rebase. Force-with-lease push only — no commits dropped or reordered.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

@coderabbitai review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/api/handler_ri_exchange_integration_test.go (1)

63-75: ⚠️ Potential issue | 🟠 Major

Don't hard-wire the unscoped recommendations path in the shared helper.

reshapeAccountResolver always returning "" means every integration test skips AccountIDs filtering. A regression in the scoped lookup would still go green here, even though cross-account isolation is one of the behaviors this PR hardens. Please make the helper accept a resolver override and add one scoped-account case that proves only the resolved tenant's recommendations are returned.

Possible helper tweak
-func buildReshapeHandler(store *config.PostgresStore, ec2Fake *fakeReshapeEC2, recsFake *fakeReshapeRecs) *Handler {
+func buildReshapeHandler(
+	store *config.PostgresStore,
+	ec2Fake *fakeReshapeEC2,
+	recsFake *fakeReshapeRecs,
+	resolveAccount func(context.Context) (string, error),
+) *Handler {
+	if resolveAccount == nil {
+		resolveAccount = func(_ context.Context) (string, error) { return "", nil }
+	}
 	h := &Handler{
 		config:             store,
 		auth:               &mockAuthForExchange{},
 		apiKey:             "test-api-key",
 		reshapeEC2Factory:  func(_ aws.Config) reshapeEC2Client { return ec2Fake },
 		reshapeRecsFactory: func(_ aws.Config) reshapeRecsClient { return recsFake },
-		reshapeAccountResolver: func(_ context.Context) (string, error) { return "", nil },
+		reshapeAccountResolver: resolveAccount,
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_ri_exchange_integration_test.go` around lines 63 - 75,
The helper buildReshapeHandler always sets reshapeAccountResolver to return ""
causing all tests to exercise the unscoped path; update buildReshapeHandler (the
Handler construction) to accept a reshapeAccountResolver override parameter
(e.g., resolver func(context.Context)(string,error)) and use that instead of the
hard-coded lambda, then add at least one integration test that passes a resolver
returning a specific tenant/cloud-account ID and asserts that reshape
recommendations returned by the Handler are filtered to that account only
(verifying scoped lookup behavior).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/api/handler_ri_exchange_integration_test.go`:
- Around line 119-126: The test's seeded same-family record uses the identical
size as the primary target (m5.large), so it doesn't prove the family-exclusion
logic; update the seed in seedRecsForRegion to use a different size within the
same family (e.g., change the same-family entry from "m5.large" to "m5.2xlarge"
in the RecommendationRecord) and add/assert that this m5.* record is not present
in the cross-family alternatives returned by the handler (repeat the same change
for the other occurrence around lines 149-155) to ensure the same-family filter
is independently validated.

In `@internal/api/handler.go`:
- Around line 552-570: In resolveAWSCallerIdentity, short-circuit non-AWS hosts
before any AWS SDK work: at the top of the function check sourceCloud() != "aws"
and immediately return "", "", nil so you do not call h.awsCfgOnce.Do /
awsconfig.LoadDefaultConfig, create an sts client, or call GetCallerIdentity on
non-AWS deployments; keep existing handling of h.awsCfgErr and the
sts.GetCallerIdentity error path for actual AWS hosts (references:
resolveAWSCallerIdentity, h.awsCfgOnce, h.awsCfgErr,
awsconfig.LoadDefaultConfig, sts.NewFromConfig, GetCallerIdentity, sourceCloud).

---

Outside diff comments:
In `@internal/api/handler_ri_exchange_integration_test.go`:
- Around line 63-75: The helper buildReshapeHandler always sets
reshapeAccountResolver to return "" causing all tests to exercise the unscoped
path; update buildReshapeHandler (the Handler construction) to accept a
reshapeAccountResolver override parameter (e.g., resolver
func(context.Context)(string,error)) and use that instead of the hard-coded
lambda, then add at least one integration test that passes a resolver returning
a specific tenant/cloud-account ID and asserts that reshape recommendations
returned by the Handler are filtered to that account only (verifying scoped
lookup behavior).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9af3d361-3262-4c5f-8f09-787d9eb2d9f6

📥 Commits

Reviewing files that changed from the base of the PR and between 1593007 and 62667c8.

📒 Files selected for processing (4)
  • internal/api/exchange_lookup.go
  • internal/api/handler.go
  • internal/api/handler_ri_exchange.go
  • internal/api/handler_ri_exchange_integration_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/api/handler_ri_exchange.go
  • internal/api/exchange_lookup.go

Comment thread internal/api/handler_ri_exchange_integration_test.go Outdated
Comment thread internal/api/handler.go
cristim added 2 commits April 28, 2026 01:14
CodeRabbit pass-5 follow-up: 62667c8 fixed loadAPIKey so it stops
sharing h.awsCfgOnce/h.awsCfgErr with the request-path identity
resolver, but the matching regression test was deferred. Adding it
here.

TestLoadAPIKey_DoesNotPoisonSharedAWSConfigCache asserts logical
isolation rather than trying to provoke a config-load failure
(LoadDefaultConfig itself rarely errs — credential failures surface
later at the SDK call). Two halves per subtest:

  1. h.awsCfgErr stays nil regardless of loadAPIKey's outcome — the
     shared error field must not be written at all.
  2. h.awsCfgOnce is still virgin after loadAPIKey returns. Verified
     with a sentinel sync.Once.Do trick: Do only runs the func if it
     has not yet been called, so if loadAPIKey had touched the shared
     Once, the sentinel would no-op and the assertion would fail.

Two table-style subtests cover both branches of the loadAPIKey path
(empty ARN short-circuit, and non-empty ARN that goes through
awsconfig.LoadDefaultConfig). AWS env vars are pinned via t.Setenv so
the SDK's default config resolution stays deterministic in CI without
hitting ambient credentials.

If a future refactor reintroduces h.awsCfgOnce.Do inside loadAPIKey,
the second subtest fails on the `fired` assertion — preventing the
multi-tenant reshape scope filter from being permanently broken by a
transient cold-start AWS config failure.
CodeRabbit pass-5 finding 2: the existing reshape integration tests
all wire reshapeAccountResolver to ("", nil), so only the unscoped
"no AccountIDs filter" path had coverage. The scoped branch — which
is the multi-tenant cross-account leak guard — was untested. A
regression that silently dropped the AccountIDs filter (e.g. a
future refactor reintroducing the resolved-but-unregistered fall-
through bug) would have shipped undetected.

This commit adds TestReshapeRecommendations_Integration_ScopedAccount_FiltersToAccount:

  1. Seeds two CloudAccounts (Tenant A + Tenant B) in the test
     Postgres container so the cloud_account_id FK on the
     recommendations table is satisfied for both rows.
  2. Seeds two cross-family recs in the same region — A's c5.large
     and B's r5.large — each tagged with its account's UUID.
  3. Overrides h.reshapeAccountResolver to return Tenant A's UUID
     (the production equivalent of resolveAWSCloudAccountID
     returning a registered match).
  4. Asserts the response surfaces ONLY c5.large in alternatives,
     never r5.large — proving the AccountIDs predicate landed on
     the SQL query rather than being dropped.

Also tightens the EndToEnd test's same-family seed: was m5.large
(same SIZE as the primary target, which trivial exact-match dedupe
would also strip), now m5.2xlarge (different size, exercising the
FAMILY-exclusion contract that's actually under test).

A new seedCloudAccount helper keeps the new test readable and
reusable for any future scoped-branch coverage. Existing tests are
untouched beyond the same-family seed clarification.

Tests: go test ./... green at root + pkg/. The new integration
test follows the same setupReshapeHandlerIntegration / docker-
based testcontainers pattern as its siblings, so it runs under
the same CI gate.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

CodeRabbit pass-5 follow-up — both findings addressed.

Finding 1 (9d1cba173): Added TestLoadAPIKey_DoesNotPoisonSharedAWSConfigCache to lock in commit 62667c8d2's fix. The test asserts logical isolation rather than provoking a LoadDefaultConfig failure (which rarely errs — credential failures surface later at the SDK call). Two halves per subtest: h.awsCfgErr stays nil regardless of loadAPIKey's outcome, and h.awsCfgOnce is still virgin afterwards (verified with a sentinel sync.Once.Do trick — Do only runs the func if it hasn't been called, so if loadAPIKey had touched the shared Once, the sentinel would no-op). Two table-style subtests cover both code paths (empty-ARN short-circuit, and non-empty ARN that goes through awsconfig.LoadDefaultConfig); AWS env vars are pinned via t.Setenv to keep CI deterministic without ambient credentials. If a future refactor reintroduces h.awsCfgOnce.Do inside loadAPIKey, the second subtest fails on the fired assertion — preventing the multi-tenant reshape scope filter from being permanently broken by a transient cold-start AWS config failure.

Finding 2 (e775264db): Added TestReshapeRecommendations_Integration_ScopedAccount_FiltersToAccount. The existing reshape integration tests all wired reshapeAccountResolver to ("", nil), so only the unscoped "no AccountIDs filter" path had coverage. The scoped branch — the multi-tenant cross-account leak guard — was untested. The new test seeds two CloudAccounts (Tenant A + Tenant B) and one cross-family rec per tenant in the same region (A's c5.large, B's r5.large, each tagged with its account UUID), overrides h.reshapeAccountResolver to return Tenant A's UUID (production-equivalent of resolveAWSCloudAccountID returning a registered match), and asserts the response surfaces only c5.large in alternatives — never r5.large. Also tightens the existing EndToEnd test's same-family seed: was m5.large (same SIZE as the primary target, which trivial exact-match dedupe would also strip), now m5.2xlarge (different size, exercising the FAMILY-exclusion contract that's actually under test) with an independent regression-guard loop that asserts no m5.* of any size leaks through.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

@cristim: Both pass-5 findings look well addressed:

  • Finding 1 (9d1cba173): The sync.Once sentinel approach is a clean way to assert that loadAPIKey never touches h.awsCfgOnce — it will catch any future regression that reintroduces shared-cache writes without requiring a LoadDefaultConfig failure to be induced artificially. Pinning t.Setenv for AWS env vars is good practice for CI determinism.

  • Finding 2 (e775264db): Seeding two tenants with distinct account-scoped recs and asserting strict per-tenant isolation closes the previously untested multi-tenant leak path. The EndToEnd seed tightening from m5.largem5.2xlarge is a meaningful improvement — it now exercises family-exclusion rather than relying on exact-match deduplication as a side effect.

Triggering review of the new commits now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

CodeRabbit pass-6 inline finding: resolveAWSCallerIdentity used to
call LoadDefaultConfig + STS regardless of host, then check
sourceCloud() only to decide whether the failure was real. That
wasted a config-load attempt on every reshape request from an
Azure/GCP-hosted Lambda — pure overhead since the function will
return ("", "", nil) anyway.

Move the sourceCloud() check to the top of the function. Non-AWS
hosts return immediately, no AWS SDK work attempted. AWS hosts with
a broken SDK config continue to fail closed with the wrapped error
so the multi-tenant scope filter in resolveAWSCloudAccountID
doesn't degrade into an unscoped read.

Comment block updated to make the short-circuit policy explicit.

Tests: full internal/api unit suite green; build clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/api/handler_test.go (1)

91-137: Make this regression test hermetic.

The test still exercises the real AWS credential/provider chain and Secrets Manager client, so it can vary with ambient profiles or IMDS and become flaky on some CI hosts. Consider disabling metadata lookup explicitly or injecting a stubbed AWS seam so the test only verifies the shared-cache contract.

Suggested tweak
 t.Setenv("AWS_REGION", "us-east-1")
+t.Setenv("AWS_EC2_METADATA_DISABLED", "true")
 t.Setenv("AWS_ACCESS_KEY_ID", "")
 t.Setenv("AWS_SECRET_ACCESS_KEY", "")
 t.Setenv("AWS_SESSION_TOKEN", "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/handler_test.go` around lines 91 - 137, The test is still
hitting the real AWS provider chain; make it hermetic by disabling metadata
lookups and/or injecting a fake Secrets Manager: in the test set
AWS_EC2_METADATA_DISABLED="true" (alongside the existing AWS_* env vars) to stop
IMDS lookups, and refactor the Handler to accept an injectable secrets-fetch
seam (e.g., a SecretsClient interface or a loadSecretFunc) used by loadAPIKey so
tests can pass a stub that simulates failures without calling AWS; update the
test to construct Handler with the stubbed client when invoking
Handler.loadAPIKey(secretsARN) to verify the shared-cache behavior
deterministically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/api/handler_test.go`:
- Around line 91-137: The test is still hitting the real AWS provider chain;
make it hermetic by disabling metadata lookups and/or injecting a fake Secrets
Manager: in the test set AWS_EC2_METADATA_DISABLED="true" (alongside the
existing AWS_* env vars) to stop IMDS lookups, and refactor the Handler to
accept an injectable secrets-fetch seam (e.g., a SecretsClient interface or a
loadSecretFunc) used by loadAPIKey so tests can pass a stub that simulates
failures without calling AWS; update the test to construct Handler with the
stubbed client when invoking Handler.loadAPIKey(secretsARN) to verify the
shared-cache behavior deterministically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e0c3896a-e413-41d9-ac4a-8fc863289ada

📥 Commits

Reviewing files that changed from the base of the PR and between 62667c8 and e775264.

📒 Files selected for processing (2)
  • internal/api/handler_ri_exchange_integration_test.go
  • internal/api/handler_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/api/handler_ri_exchange_integration_test.go

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

Pushed b877b80cf — addresses CodeRabbit pass-6 inline finding 1 (resolveAWSCallerIdentity short-circuit). Non-AWS hosts now skip LoadDefaultConfig + STS entirely; AWS hosts continue to fail closed on broken SDK config.

Combined with the concurrent session's e775264db (scoped-account integration test + same-family m5.2xlarge seed) and 9d1cba173 (loadAPIKey regression test), all CodeRabbit findings from passes 5 and 6 should now be addressed.

The buildReshapeHandler signature refactor CodeRabbit suggested isn't applied — e775264db instead added the scoped-branch test by overriding h.reshapeAccountResolver directly post-construction, which closes the underlying coverage gap (the MAJOR's actual ask) without the helper-arg shape change.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

CodeRabbit pass-7 nitpick: TestLoadAPIKey_DoesNotPoisonSharedAWSConfigCache
cleared the credential env vars but left IMDS reachable. On CI
runners or dev EC2 hosts where the metadata service is reachable,
the SDK credential chain probes IMDS and waits the full retry
window before falling through to "no credentials" — making the
test slow and host-dependent.

Add t.Setenv("AWS_EC2_METADATA_DISABLED", "true") so the chain
short-circuits past IMDS and the test outcome depends only on
the static env-var configuration we control.

Test still passes; behaviour assertion (the shared cache stays
untouched) is unchanged — this just makes the path deterministic.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 28, 2026

Pushed fa371a42c — applies the CodeRabbit nitpick by setting AWS_EC2_METADATA_DISABLED=true in the loadAPIKey regression test. Skipped the deeper SecretsClient injection seam suggestion — that's a wider refactor than the regression test needs, and IMDS-disable plus the existing zero-credential env vars is enough to make the test deterministic.

All CodeRabbit findings from passes 1–7 should now be addressed.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented Apr 28, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit 62b8277 into feat/multicloud-web-frontend Apr 28, 2026
3 checks passed
@cristim cristim added triaged Item has been triaged priority/p3 Polish / idea / may never ship severity/low Minor harm urgency/this-quarter Within the quarter impact/few Limited audience effort/xl Multi-week / refactor type/feat New capability labels Apr 28, 2026
@cristim cristim deleted the perf/exchange-rec-driven-alternatives branch April 30, 2026 14:00
cristim added a commit that referenced this pull request May 3, 2026
… empty-region normalization (closes #151) (#230)

Adds two regression-guard tests covering the security paths PR #79 +
follow-up commits f44b065 and afc3aa1 hardened. Both fixes landed
without coverage; a future refactor could silently break either gate.

Test 1 — TestResolveAWSCloudAccountID_FailsClosedOnSTSError
   Pins commit f44b065's multi-tenant safety contract: when STS
   GetCallerIdentity fails (transient AWS error, denied perm, expired
   token), resolveAWSCloudAccountID propagates the error rather than
   returning ("", nil). The empty-with-no-error path would have
   purchaseRecLookupFromStore skip the AccountIDs filter and surface
   another tenant's recs — exactly the leak the production fix
   prevents.

   Mock injection: a failing http.RoundTripper installed on the
   handler's pre-sealed aws.Config makes STS GetCallerIdentity fail
   without hitting the network. Static dummy credentials + an explicit
   region let the SDK's signer compose the request before the failing
   transport runs (otherwise the SDK would error with MissingRegion /
   NoCredentialProviders before we reached the STS branch). The test
   asserts the wrapped error AND that ListCloudAccounts was never
   invoked — proving the resolver short-circuited before the DB read.

Test 2 — TestGetReshapeRecommendations_EmptyRegionUsesConfigRegion
   Pins commit afc3aa1's region-normalization fix: when ?region=
   is empty (or omitted), the handler adopts cfg.Region (the value
   the AWS clients are actually talking to) before threading region
   through to the recs lookup. Without normalization the recs query
   lands as Region:"" — an unscoped read that surfaces alternatives
   from every region onto the reshape page.

   Mock injection mirrors the integration-test helpers but lives in
   the unit-test file (no //go:build integration tag). MockConfigStore
   captures every ListStoredRecommendations filter via mock.Run so
   the assertion can pin filter.Region == cfg.Region. A guard
   require.NotEmpty(capturedFilters) prevents a silent-pass failure
   mode if the analyzer ever stops invoking the closure for synthetic
   RIs.

Would-fail-if-reverted verification (manual, before commit):
- Commenting out the `if region == "" { region = cfg.Region }` block
  in handler_ri_exchange.go made Test 2 fail with the captured filter
  showing Region:"" instead of "us-west-2".
- Changing the STS-error branch in exchange_lookup.go from
  `return "", fmt.Errorf(...)` back to `return "", nil` made Test 1
  fail with "An error is expected but got nil".

No production code modified — additive tests only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/xl Multi-week / refactor impact/few Limited audience priority/p3 Polish / idea / may never ship severity/low Minor harm triaged Item has been triaged type/feat New capability urgency/this-quarter Within the quarter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant