Skip to content

refactor: Migrate vpc handlers to WithTx#472

Merged
chet merged 1 commit intoNVIDIA:mainfrom
chet:with-tx-vpc
May 6, 2026
Merged

refactor: Migrate vpc handlers to WithTx#472
chet merged 1 commit intoNVIDIA:mainfrom
chet:with-tx-vpc

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented May 2, 2026

Description

This applies the new WithTx pattern from #462 to the vpc handlers (Create/Update/Delete plus the virtualization-type update).

The UpdateVirtualization handler uses the timeoutResp closure pattern for common.TerminateWorkflowOnTimeOut so the helper still runs after the tx unwinds; the other three retain their inline stc.TerminateWorkflow paths.

Keeping these PRs smaller and tightly scoped so they're:

  • In theory a little easier to read.
  • More tightly scoped/less "blast radius" per PR.
  • A little nicer on/for @coderabbitai. 😆

I do I wish the diffs were easier to read, but it is what it is!

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@chet chet requested a review from a team as a code owner May 2, 2026 05:40
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Caution

Review failed

Failed to post review comments

Summary by CodeRabbit

  • Refactor
    • Improved VPC operation internals for more reliable create, update, virtualization update, and delete flows, ensuring consistent status transitions, synchronous orchestration, and best-effort post-operation updates.
  • Tests
    • Updated test expectation text to reflect revised timeout/error messaging for VPC update workflows.

Walkthrough

Refactors VPC handlers to run DB updates and synchronous Temporal workflow invocations inside cdb.WithTx closures. Handlers initialize needed request/state outside transactions, defer workflow-timeout termination via closures, and perform post-transaction best-effort reconciliation (e.g., Active VNI, Ready status) where applicable.

Changes

VPC Handler Transaction Refactoring

Layer / File(s) Summary
Import / Setup
api/pkg/api/handler/vpc.go
Removed manual transaction Begin/Commit/Rollback usage; transaction control delegated to cdb.WithTx. Added timeoutResp termination closures and workflow result placeholders declared outside transactional closures.
State Init / Request Build
api/pkg/api/handler/vpc.go
Controller/workflow request structs (e.g., controllerVpc, uv), status-detail DAO (sdDAO), and status-detail slices (ssds) are prepared outside the transaction to enable deferred actions and post-transaction reconciliation.
Create VPC: DB + Workflow
api/pkg/api/handler/vpc.go (CreateVPC handler)
Inside cdb.WithTx: create VPC record, update controller ID, create provisioning status-detail, obtain Site Temporal client, and execute synchronous CreateVPCV2 workflow. After the transaction, best-effort updates set ActiveVni and add Ready status-detail if workflow succeeded; timeoutResp handles workflow timeouts.
Update VPC: DB + Workflow
api/pkg/api/handler/vpc.go (UpdateVPC handler)
Inside cdb.WithTx: apply VPC field updates, optionally clear NSG/NVLink propagation state, fetch status details, obtain Site Temporal client, and execute synchronous UpdateVPC workflow. Post-transaction maps DB model to API model and defers timeout termination via timeoutResp.
Update VPC Virtualization: DB + Workflow
api/pkg/api/handler/vpc.go (UpdateVPCVirtualization handler)
Inside cdb.WithTx: update VPC virtualization type, fetch status history, obtain Site Temporal client, and execute synchronous UpdateVPCVirtualization workflow. Timeout termination is deferred through timeoutResp; workflow results are mapped to API errors afterward.
Delete VPC: DB + Workflow
api/pkg/api/handler/vpc.go (DeleteVPC handler)
Inside cdb.WithTx: set VPC status to Deleting, create a best-effort deletion status-detail, obtain Site Temporal client, and execute synchronous DeleteVPCV2 workflow. NotFound from workflow is treated as skippable. Handler returns 202 Accepted after transaction completes; timeout handling deferred.
Tests
api/pkg/api/handler/vpc_test.go
Updated expected error message for the VPC update workflow timeout test to match revised timeout text.
Post-Transaction Handling / Errors
api/pkg/api/handler/vpc.go
After WithTx, handlers invoke deferred timeoutResp() when set, perform best-effort state reconciliation, and translate transaction/workflow errors into API errors (via existing utilities like common.HandleTxError).

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'refactor: Migrate vpc handlers to WithTx' directly and clearly summarizes the main change—migrating VPC handlers to use the WithTx transaction pattern.
Description check ✅ Passed The description is directly related to the changeset, explaining which handlers are being refactored to use WithTx, the pattern variations employed, and referencing the originating issue #462.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-02 05:41:34 UTC | Commit: e1bbf86

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/vpc.go`:
- Around line 449-466: The timeout handling currently calls
stc.TerminateWorkflow from inside the cdb.WithTx closure (checking
errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded ||
wfCtx.Err() != nil), holding the DB transaction during the RPC; instead, set a
local flag (e.g., timeoutResp or a boolean like needTerminate) and capture the
workflow ID (wid) and wferr inside the closure, then return from WithTx; after
WithTx completes, if needTerminate is true create a new context with
context.WithTimeout using cutil.WorkflowContextNewAfterTimeout and call
stc.TerminateWorkflow (handling serr/logging and returning the appropriate
cutil.NewAPIError), mirroring the pattern used by UpdateVirtualization to avoid
making remote calls while the transaction is open; reference
stc.TerminateWorkflow, timeoutErr, wferr, wfCtx, wid,
cutil.WorkflowContextNewAfterTimeout and logger to locate and implement the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f7a330e7-4070-4834-be73-35fa422f9384

📥 Commits

Reviewing files that changed from the base of the PR and between bce7503 and e1bbf86.

📒 Files selected for processing (1)
  • api/pkg/api/handler/vpc.go

Comment thread api/pkg/api/handler/vpc.go Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-nsm 64 2 20 33 9 0
nico-psm 56 4 29 13 2 8
nico-rest-api 57 4 30 13 2 8
nico-rest-cert-manager 54 4 28 13 1 8
nico-rest-db 55 4 28 13 2 8
nico-rest-site-agent 54 4 28 13 1 8
nico-rest-site-manager 54 4 28 13 1 8
nico-rest-workflow 56 4 29 13 2 8
nico-rla 55 4 28 13 2 8
TOTAL 505 34 248 137 22 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

var timeoutErr *tp.TimeoutError
if errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded || wfCtx.Err() != nil {
logger.Error().Err(wferr).Msg("failed to create VPC, timeout occurred executing workflow on Site.")
timeoutResp = func() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m trying to understand how this cancel workflow will execute based on this pattern.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithTx returns APIError, which gets processed after WithTx executes. Other HandleTxError manages returning error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwadekar-nv Yeahhh good question! The timeoutResp flow is:

  1. Inside the closure, when we detect a timeout, we don't call TerminateWorkflow directly. Instead, we stash a closure in the outer-scope timeoutResp that knows how to do the termination work.
  2. We then return a *cutil.APIError from the WithTx closure, which causes WithTx to roll back the transaction.
  3. After WithTx returns (rollback complete, locks released, etc), the outer code checks if timeoutResp is set; that's when the actual TerminateWorkflow RPC fires (calling TerminateWorkflowOnTimeOut, which builds a fresh context with WorkflowContextNewAfterTimeout + hits Site to terminate the orphaned workflow + and returns an error to the caller.

And if it's not obvious at first glance, the reason we don't inline TeriminateWorkflow while still inside the closure is because the DB transaction would stay open during that second RPC, which would continue to hold locks + a connection while we wait on the network/RPC.

We ran into this a bunch in the Core codebase, and have spent almost year trying to clean it up and unwind it (we even have a custom compiler extension now that looks for blocking calls during transactions).

Lemme know if that all makes sense!

Copy link
Copy Markdown
Contributor

@thossain-nv thossain-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great refactor @chet Added a suggestion for timeout handling.

var timeoutErr *tp.TimeoutError
if errors.As(wferr, &timeoutErr) || wferr == context.DeadlineExceeded || wfCtx.Err() != nil {
logger.Error().Err(wferr).Msg("failed to create VPC, timeout occurred executing workflow on Site.")
timeoutResp = func() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make use if the existing utility method TerminateWorkflowOnTimeOut for the timeout handling?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thossain-nv omg yes! amazing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

This applies the new `WithTx` pattern from NVIDIA#462 to the `vpc` handlers (Create/Update/Delete plus the virtualization-type update). The `UpdateVirtualization` handler uses the `timeoutResp` closure pattern for `common.TerminateWorkflowOnTimeOut` so the helper still runs after the tx unwinds; the other three retain their inline `stc.TerminateWorkflow` paths

Keeping these PRs smaller and tightly scoped so they're:
- In theory a little easier to read.
- More tightly scoped/less "blast radius" per PR.
- A little nicer on/for @coderabbitai. 😆

I do I wish the diffs were easier to read, but it is what it is!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet merged commit c62e583 into NVIDIA:main May 6, 2026
50 checks passed
chet added a commit to chet/bare-metal-manager-rest that referenced this pull request May 6, 2026
Applies `WithTx` (and `WithTxResult`!) from NVIDIA#462 to the `Create`/`Update`/`Delete` NSG handlers.

Implements our "`timeoutResp` pattern" (which is something we had introduced in NVIDIA#472, and then @coderabbitai said we should be consistent by doing it everywhere). TLDR is the existing code calls `common.TerminateWorkflowOnTimeOut` on timeout, but we want to defer that until after the transaction is unwound + DB connection back (because we don't want it to block waiting on the network).

The adjustment (which we've done before, but figured I'd call it out more explicitly here) is effectively:
```
    var timeoutResp func() error

    err = cdb.WithTx(ctx, ..., func(tx *cdb.Tx) error {
      ...
      if /* workflow timeout detected */ {
        // capture the terminate work, but DON'T do it yet
        timeoutResp = func() error {
          return common.TerminateWorkflowOnTimeOut(...)
        }
        return cutil.NewAPIError(...)   // forces rollback
      }
      ...
    })

    // rollback has now completed, now we do potentially blocking network work
    if timeoutResp != nil {
      return timeoutResp()
    }
```

Also addressed some @coderabbitai feedback around log messages in advance.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit to chet/bare-metal-manager-rest that referenced this pull request May 6, 2026
Applies `WithTx` (and `WithTxResult`!) from NVIDIA#462 to the `Create`/`Update`/`Delete` NSG handlers.

Implements our "`timeoutResp` pattern" (which is something we had introduced in NVIDIA#472, and then @coderabbitai said we should be consistent by doing it everywhere). TLDR is the existing code calls `common.TerminateWorkflowOnTimeOut` on timeout, but we want to defer that until after the transaction is unwound + DB connection back (because we don't want it to block waiting on the network).

The adjustment (which we've done before, but figured I'd call it out more explicitly here) is effectively:
```
    var timeoutResp func() error

    err = cdb.WithTx(ctx, ..., func(tx *cdb.Tx) error {
      ...
      if /* workflow timeout detected */ {
        // capture the terminate work, but DON'T do it yet
        timeoutResp = func() error {
          return common.TerminateWorkflowOnTimeOut(...)
        }
        return cutil.NewAPIError(...)   // forces rollback
      }
      ...
    })

    // rollback has now completed, now we do potentially blocking network work
    if timeoutResp != nil {
      return timeoutResp()
    }
```

Also addressed some @coderabbitai feedback around log messages in advance.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
chet added a commit to chet/bare-metal-manager-rest that referenced this pull request May 6, 2026
Applies `WithTx` (and `WithTxResult`!) from NVIDIA#462 to the `Create`/`Update`/`Delete` NSG handlers.

Implements our "`timeoutResp` pattern" (which is something we had introduced in NVIDIA#472, and then @coderabbitai said we should be consistent by doing it everywhere). TLDR is the existing code calls `common.TerminateWorkflowOnTimeOut` on timeout, but we want to defer that until after the transaction is unwound + DB connection back (because we don't want it to block waiting on the network).

The adjustment (which we've done before, but figured I'd call it out more explicitly here) is effectively:
```
    var timeoutResp func() error

    err = cdb.WithTx(ctx, ..., func(tx *cdb.Tx) error {
      ...
      if /* workflow timeout detected */ {
        // capture the terminate work, but DON'T do it yet
        timeoutResp = func() error {
          return common.TerminateWorkflowOnTimeOut(...)
        }
        return cutil.NewAPIError(...)   // forces rollback
      }
      ...
    })

    // rollback has now completed, now we do potentially blocking network work
    if timeoutResp != nil {
      return timeoutResp()
    }
```

Also addressed some @coderabbitai feedback around log messages in advance.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants