refactor: Migrate vpc handlers to WithTx by chet · Pull Request #472 · NVIDIA/infra-controller-rest

chet · 2026-05-02T05:40:22Z

Description

This applies the new WithTx pattern from #462 to the vpc handlers (Create/Update/Delete plus the virtualization-type update).

The UpdateVirtualization handler uses the timeoutResp closure pattern for common.TerminateWorkflowOnTimeOut so the helper still runs after the tx unwinds; the other three retain their inline stc.TerminateWorkflow paths.

Keeping these PRs smaller and tightly scoped so they're:

In theory a little easier to read.
More tightly scoped/less "blast radius" per PR.
A little nicer on/for @coderabbitai. 😆

I do I wish the diffs were easier to read, but it is what it is!

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

Feature - New feature or functionality (feat:)
Fix - Bug fixes (fix:)
Chore - Modification or removal of existing functionality (chore:)
Refactor - Refactoring of existing functionality (refactor:)
Docs - Changes in documentation or OpenAPI schema (docs:)
CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
Version - Issuing a new release version (version:)

Services Affected

API - API models or endpoints updated
Workflow - Workflow service updated
DB - DB DAOs or migrations updated
Site Manager - Site Manager updated
Cert Manager - Cert Manager updated
Site Agent - Site Agent updated
RLA - RLA service updated
Powershelf Manager - Powershelf Manager updated
NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

coderabbitai · 2026-05-02T05:40:33Z

Caution

Review failed

Failed to post review comments

Summary by CodeRabbit

Refactor
- Improved VPC operation internals for more reliable create, update, virtualization update, and delete flows, ensuring consistent status transitions, synchronous orchestration, and best-effort post-operation updates.
Tests
- Updated test expectation text to reflect revised timeout/error messaging for VPC update workflows.

Walkthrough

Refactors VPC handlers to run DB updates and synchronous Temporal workflow invocations inside cdb.WithTx closures. Handlers initialize needed request/state outside transactions, defer workflow-timeout termination via closures, and perform post-transaction best-effort reconciliation (e.g., Active VNI, Ready status) where applicable.

Changes

VPC Handler Transaction Refactoring

Layer / File(s)	Summary
Import / Setup `api/pkg/api/handler/vpc.go`	Removed manual transaction Begin/Commit/Rollback usage; transaction control delegated to `cdb.WithTx`. Added `timeoutResp` termination closures and workflow result placeholders declared outside transactional closures.
State Init / Request Build `api/pkg/api/handler/vpc.go`	Controller/workflow request structs (e.g., `controllerVpc`, `uv`), status-detail DAO (`sdDAO`), and status-detail slices (`ssds`) are prepared outside the transaction to enable deferred actions and post-transaction reconciliation.
Create VPC: DB + Workflow `api/pkg/api/handler/vpc.go` (CreateVPC handler)	Inside `cdb.WithTx`: create VPC record, update controller ID, create provisioning status-detail, obtain Site Temporal client, and execute synchronous `CreateVPCV2` workflow. After the transaction, best-effort updates set `ActiveVni` and add Ready status-detail if workflow succeeded; `timeoutResp` handles workflow timeouts.
Update VPC: DB + Workflow `api/pkg/api/handler/vpc.go` (UpdateVPC handler)	Inside `cdb.WithTx`: apply VPC field updates, optionally clear NSG/NVLink propagation state, fetch status details, obtain Site Temporal client, and execute synchronous `UpdateVPC` workflow. Post-transaction maps DB model to API model and defers timeout termination via `timeoutResp`.
Update VPC Virtualization: DB + Workflow `api/pkg/api/handler/vpc.go` (UpdateVPCVirtualization handler)	Inside `cdb.WithTx`: update VPC virtualization type, fetch status history, obtain Site Temporal client, and execute synchronous `UpdateVPCVirtualization` workflow. Timeout termination is deferred through `timeoutResp`; workflow results are mapped to API errors afterward.
Delete VPC: DB + Workflow `api/pkg/api/handler/vpc.go` (DeleteVPC handler)	Inside `cdb.WithTx`: set VPC status to `Deleting`, create a best-effort deletion status-detail, obtain Site Temporal client, and execute synchronous `DeleteVPCV2` workflow. NotFound from workflow is treated as skippable. Handler returns `202 Accepted` after transaction completes; timeout handling deferred.
Tests `api/pkg/api/handler/vpc_test.go`	Updated expected error message for the VPC update workflow timeout test to match revised timeout text.
Post-Transaction Handling / Errors `api/pkg/api/handler/vpc.go`	After `WithTx`, handlers invoke deferred `timeoutResp()` when set, perform best-effort state reconciliation, and translate transaction/workflow errors into API errors (via existing utilities like `common.HandleTxError`).

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'refactor: Migrate vpc handlers to WithTx' directly and clearly summarizes the main change—migrating VPC handlers to use the WithTx transaction pattern.
Description check	✅ Passed	The description is directly related to the changeset, explaining which handlers are being refactored to use WithTx, the pattern variations employed, and referencing the originating issue `#462`.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-02T05:41:35Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-05-02 05:41:34 UTC | Commit: e1bbf86}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/vpc.go`:
- Around line 449-466: The timeout handling currently calls
stc.TerminateWorkflow from inside the cdb.WithTx closure (checking
errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded ||
wfCtx.Err() != nil), holding the DB transaction during the RPC; instead, set a
local flag (e.g., timeoutResp or a boolean like needTerminate) and capture the
workflow ID (wid) and wferr inside the closure, then return from WithTx; after
WithTx completes, if needTerminate is true create a new context with
context.WithTimeout using cutil.WorkflowContextNewAfterTimeout and call
stc.TerminateWorkflow (handling serr/logging and returning the appropriate
cutil.NewAPIError), mirroring the pattern used by UpdateVirtualization to avoid
making remote calls while the transaction is open; reference
stc.TerminateWorkflow, timeoutErr, wferr, wfCtx, wid,
cutil.WorkflowContextNewAfterTimeout and logger to locate and implement the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f7a330e7-4070-4834-be73-35fa422f9384

📥 Commits

Reviewing files that changed from the base of the PR and between bce7503 and e1bbf86.

📒 Files selected for processing (1)

api/pkg/api/handler/vpc.go

github-actions · 2026-05-02T05:53:18Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
nico-nsm	64	2	20	33	9	0
nico-psm	56	4	29	13	2	8
nico-rest-api	57	4	30	13	2	8
nico-rest-cert-manager	54	4	28	13	1	8
nico-rest-db	55	4	28	13	2	8
nico-rest-site-agent	54	4	28	13	1	8
nico-rest-site-manager	54	4	28	13	1	8
nico-rest-workflow	56	4	29	13	2	8
nico-rla	55	4	28	13	2	8
TOTAL	505	34	248	137	22	64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

hwadekar-nv · 2026-05-04T17:24:47Z

+			var timeoutErr *tp.TimeoutError
+			if errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded || wfCtx.Err() != nil {
+				logger.Error().Err(wferr).Msg("failed to create VPC, timeout occurred executing workflow on Site.")
+				timeoutResp = func() error {


I’m trying to understand how this cancel workflow will execute based on this pattern.

WithTx returns APIError, which gets processed after WithTx executes. Other HandleTxError manages returning error.

@hwadekar-nv Yeahhh good question! The timeoutResp flow is:

Inside the closure, when we detect a timeout, we don't call TerminateWorkflow directly. Instead, we stash a closure in the outer-scope timeoutResp that knows how to do the termination work.

We then return a *cutil.APIError from the WithTx closure, which causes WithTx to roll back the transaction.

After WithTx returns (rollback complete, locks released, etc), the outer code checks if timeoutResp is set; that's when the actual TerminateWorkflow RPC fires (calling TerminateWorkflowOnTimeOut, which builds a fresh context with WorkflowContextNewAfterTimeout + hits Site to terminate the orphaned workflow + and returns an error to the caller.

And if it's not obvious at first glance, the reason we don't inline TeriminateWorkflow while still inside the closure is because the DB transaction would stay open during that second RPC, which would continue to hold locks + a connection while we wait on the network/RPC.

We ran into this a bunch in the Core codebase, and have spent almost year trying to clean it up and unwind it (we even have a custom compiler extension now that looks for blocking calls during transactions).

Lemme know if that all makes sense!

thossain-nv

Thank you for the great refactor @chet Added a suggestion for timeout handling.

thossain-nv · 2026-05-05T19:06:10Z

+			var timeoutErr *tp.TimeoutError
+			if errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded || wfCtx.Err() != nil {
+				logger.Error().Err(wferr).Msg("failed to create VPC, timeout occurred executing workflow on Site.")
+				timeoutResp = func() error {


Should we make use if the existing utility method TerminateWorkflowOnTimeOut for the timeout handling?

@thossain-nv omg yes! amazing.

This applies the new `WithTx` pattern from NVIDIA#462 to the `vpc` handlers (Create/Update/Delete plus the virtualization-type update). The `UpdateVirtualization` handler uses the `timeoutResp` closure pattern for `common.TerminateWorkflowOnTimeOut` so the helper still runs after the tx unwinds; the other three retain their inline `stc.TerminateWorkflow` paths Keeping these PRs smaller and tightly scoped so they're: - In theory a little easier to read. - More tightly scoped/less "blast radius" per PR. - A little nicer on/for @coderabbitai. 😆 I do I wish the diffs were easier to read, but it is what it is! Signed-off-by: Chet Nichols III <chetn@nvidia.com>

Applies `WithTx` (and `WithTxResult`!) from NVIDIA#462 to the `Create`/`Update`/`Delete` NSG handlers. Implements our "`timeoutResp` pattern" (which is something we had introduced in NVIDIA#472, and then @coderabbitai said we should be consistent by doing it everywhere). TLDR is the existing code calls `common.TerminateWorkflowOnTimeOut` on timeout, but we want to defer that until after the transaction is unwound + DB connection back (because we don't want it to block waiting on the network). The adjustment (which we've done before, but figured I'd call it out more explicitly here) is effectively: ``` var timeoutResp func() error err = cdb.WithTx(ctx, ..., func(tx *cdb.Tx) error { ... if /* workflow timeout detected */ { // capture the terminate work, but DON'T do it yet timeoutResp = func() error { return common.TerminateWorkflowOnTimeOut(...) } return cutil.NewAPIError(...) // forces rollback } ... }) // rollback has now completed, now we do potentially blocking network work if timeoutResp != nil { return timeoutResp() } ``` Also addressed some @coderabbitai feedback around log messages in advance. Signed-off-by: Chet Nichols III <chetn@nvidia.com>

chet requested a review from a team as a code owner May 2, 2026 05:40

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread api/pkg/api/handler/vpc.go Outdated

chet force-pushed the with-tx-vpc branch from e1bbf86 to eadc66d Compare May 2, 2026 06:55

chet mentioned this pull request May 4, 2026

refactor: Migrate Network Security Group API handlers to WithTx transaction helper #478

Open

21 tasks

hwadekar-nv reviewed May 4, 2026

View reviewed changes

thossain-nv approved these changes May 5, 2026

View reviewed changes

chet force-pushed the with-tx-vpc branch from eadc66d to 4b988a1 Compare May 6, 2026 17:00

chet merged commit c62e583 into NVIDIA:main May 6, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Migrate vpc handlers to WithTx#472

refactor: Migrate vpc handlers to WithTx#472
chet merged 1 commit intoNVIDIA:mainfrom
chet:with-tx-vpc

chet commented May 2, 2026

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Review failed

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 2, 2026 •

edited

Loading

Uh oh!

hwadekar-nv May 4, 2026

Uh oh!

thossain-nv May 5, 2026

Uh oh!

chet May 6, 2026

Uh oh!

thossain-nv left a comment

Uh oh!

thossain-nv May 5, 2026

Uh oh!

chet May 6, 2026

Uh oh!

chet May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chet commented May 2, 2026

Description

Type of Change

Services Affected

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Estimated Code Review Effort

Uh oh!

github-actions Bot commented May 2, 2026

🔐 TruffleHog Secret Scan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

hwadekar-nv May 4, 2026

Choose a reason for hiding this comment

Uh oh!

thossain-nv May 5, 2026

Choose a reason for hiding this comment

Uh oh!

chet May 6, 2026

Choose a reason for hiding this comment

Uh oh!

thossain-nv left a comment

Choose a reason for hiding this comment

Uh oh!

thossain-nv May 5, 2026

Choose a reason for hiding this comment

Uh oh!

chet May 6, 2026

Choose a reason for hiding this comment

Uh oh!

chet May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 2, 2026 •

edited

Loading

github-actions Bot commented May 2, 2026 •

edited

Loading