Skip to content

fix(workloadmanager): prevent deletion of healthy sandbox on store ca…#320

Open
Abhinav-kodes wants to merge 2 commits into
volcano-sh:mainfrom
Abhinav-kodes:fix-sandbox-rollback
Open

fix(workloadmanager): prevent deletion of healthy sandbox on store ca…#320
Abhinav-kodes wants to merge 2 commits into
volcano-sh:mainfrom
Abhinav-kodes:fix-sandbox-rollback

Conversation

@Abhinav-kodes
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:
Currently, in createSandbox, if a sandbox successfully starts, receives a Pod IP, and has its entrypoints verified, a subsequent failure in s.storeClient.UpdateSandbox will trigger the deferred rollback function. This results in the healthy Kubernetes pod being deleted purely due to a transient cache/store write failure.

Because the K8s cluster is the source of truth and the workload is already running and verified, it should not be destroyed if the secondary store fails to update.

This PR:

  1. Moves the needRollbackSandbox = false clearance above the UpdateSandbox call to prevent destroying verified, running K8s workloads.
  2. Adds a targeted cleanup step (DeleteSandboxBySessionID) using a new, detached context if UpdateSandbox fails. This ensures we don't leave a stale "placeholder" entry in the store, while safely leaving the pod alive.

Special notes for your reviewer:

  • Testing Change: The existing test TestServerCreateSandbox/update_store_fails_triggers_rollback originally expected 1 delete call, which effectively enforced the bug. I updated this test to expect 0 delete calls to enforce the correct architectural behavior (a healthy pod should not be deleted on a store update failure).
  • Context Handling: The store placeholder cleanup explicitly uses a new context.WithTimeout(context.Background(), ...) because the original request ctx may have already been canceled (e.g., client disconnected/timed out), which would cause the cleanup to fail instantly if reused.

Does this PR introduce a user-facing change?:

Fixed an issue where a successfully created sandbox could be erroneously deleted due to a transient internal database/cache error.

Copilot AI review requested due to automatic review settings May 11, 2026 14:33
@volcano-sh-bot volcano-sh-bot added the kind/bug Something isn't working label May 11, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…che update failure

Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the sandbox creation workflow by introducing manual store cleanup when a store update fails and preventing the automatic rollback of the sandbox resource. Feedback highlights a significant consistency risk where moving the rollback flag results in unmanaged 'zombie' pods that consume resources but are inaccessible via the API. It is also recommended to enhance test coverage to verify that the manual store cleanup is correctly executed.

Comment on lines +270 to 283
needRollbackSandbox = false
if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {

cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil {
klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr)
}

if isContextError(err) {
return nil, err
}
return nil, api.NewInternalError(fmt.Errorf("update store cache failed: %w", err))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change introduces a potential resource leak and consistency issue. By setting needRollbackSandbox = false before the UpdateSandbox call, the Kubernetes pod is no longer deleted if the store update fails. However, the function still returns an error to the client (lines 280, 282), meaning the user receives a failure response and does not receive the sandbox entrypoints.

This results in a "zombie" pod that is running and consuming resources in the cluster but is unmanaged (since its store entry is deleted at line 275) and inaccessible to the user via the API. If the store update is critical enough to return an error, the entire operation should likely be rolled back to maintain consistency. If the pod is considered healthy enough to keep, consider returning the success response despite the store error, or implementing a background reconciliation to sync the store later.

Comment thread pkg/workloadmanager/handlers_test.go Outdated
expectCreateCalls: 1,
expectUpdateCalls: 1,
expectDeleteCalls: 1,
expectDeleteCalls: 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this change correctly reflects that the Kubernetes resource is no longer deleted when the store update fails, the test does not verify that the manual store cleanup (DeleteSandboxBySessionID) is actually performed. It is recommended to update the fakeStore (and the test assertions) to track calls to DeleteSandboxBySessionID to ensure that the store placeholder is properly removed in this failure scenario.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Workload Manager sandbox creation rollback behavior so that a successfully started and verified Kubernetes sandbox is not deleted when the secondary store/cache update fails, and it attempts to clean up the store “placeholder” entry on that failure path.

Changes:

  • Clears needRollbackSandbox before storeClient.UpdateSandbox to prevent deleting a healthy K8s sandbox on store update failures.
  • On UpdateSandbox error, attempts to remove the store placeholder using DeleteSandboxBySessionID with a detached timeout context.
  • Updates the TestServerCreateSandbox table test to expect no K8s delete rollback when the store update fails.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/workloadmanager/handlers.go Prevents rollback deletion of a verified sandbox on store update failure; adds best-effort store placeholder cleanup with a detached context.
pkg/workloadmanager/handlers_test.go Updates the rollback expectations for the “update store fails” scenario.

Comment thread pkg/workloadmanager/handlers.go Outdated
Comment on lines +272 to +278

cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil {
klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr)
}

Comment thread pkg/workloadmanager/handlers_test.go Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 54.54545% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.76%. Comparing base (524e55e) to head (800c2f3).
⚠️ Report is 41 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/handlers.go 54.54% 8 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
+ Coverage   47.57%   47.76%   +0.19%     
==========================================
  Files          30       30              
  Lines        2819     2862      +43     
==========================================
+ Hits         1341     1367      +26     
- Misses       1338     1345       +7     
- Partials      140      150      +10     
Flag Coverage Δ
unittests 47.76% <54.54%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings May 11, 2026 14:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread pkg/workloadmanager/handlers_test.go Outdated
Comment on lines 171 to 178
expectUpdateCalls: 1,
expectDeleteCalls: 1,
expectDeleteCalls: 0,
},
Comment thread pkg/workloadmanager/handlers.go Outdated
Comment on lines +257 to +263
if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {

cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil {
klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr)
}
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread pkg/workloadmanager/handlers.go
Comment thread pkg/workloadmanager/handlers_test.go
@Abhinav-kodes Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 8f79e5e to 1f392ff Compare May 11, 2026 21:47
Copilot AI review requested due to automatic review settings May 11, 2026 22:21
@Abhinav-kodes Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 1f392ff to 82cb9d0 Compare May 11, 2026 22:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +258 to +262

go func(sessionID, ns, name string) {
cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sessionID); delErr != nil {

needRollbackSandbox = false
if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {

Comment on lines +258 to +266

go func(sessionID, ns, name string) {
cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sessionID); delErr != nil {
klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", ns, name, sessionID, delErr)
}
}(sandboxEntry.SessionID, sandbox.Namespace, sandbox.Name)

Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>
@Abhinav-kodes Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 82cb9d0 to 800c2f3 Compare May 11, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants