fix(workloadmanager): prevent deletion of healthy sandbox on store ca… by Abhinav-kodes · Pull Request #320 · volcano-sh/agentcube

Abhinav-kodes · 2026-05-11T14:33:11Z

What type of PR is this?

/kind bug

What this PR does / why we need it:
Currently, in createSandbox, if a sandbox successfully starts, receives a Pod IP, and has its entrypoints verified, a subsequent failure in s.storeClient.UpdateSandbox will trigger the deferred rollback function. This results in the healthy Kubernetes pod being deleted purely due to a transient cache/store write failure.

Because the K8s cluster is the source of truth and the workload is already running and verified, it should not be destroyed if the secondary store fails to update.

This PR:

Moves the needRollbackSandbox = false clearance above the UpdateSandbox call to prevent destroying verified, running K8s workloads.
Adds a targeted cleanup step (DeleteSandboxBySessionID) using a new, detached context if UpdateSandbox fails. This ensures we don't leave a stale "placeholder" entry in the store, while safely leaving the pod alive.

Special notes for your reviewer:

Testing Change: The existing test TestServerCreateSandbox/update_store_fails_triggers_rollback originally expected 1 delete call, which effectively enforced the bug. I updated this test to expect 0 delete calls to enforce the correct architectural behavior (a healthy pod should not be deleted on a store update failure).
Context Handling: The store placeholder cleanup explicitly uses a new context.WithTimeout(context.Background(), ...) because the original request ctx may have already been canceled (e.g., client disconnected/timed out), which would cause the cleanup to fail instantly if reused.

Does this PR introduce a user-facing change?:

Fixed an issue where a successfully created sandbox could be erroneously deleted due to a transient internal database/cache error.

volcano-sh-bot · 2026-05-11T14:33:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/workloadmanager/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…che update failure Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request modifies the sandbox creation workflow by introducing manual store cleanup when a store update fails and preventing the automatic rollback of the sandbox resource. Feedback highlights a significant consistency risk where moving the rollback flag results in unmanaged 'zombie' pods that consume resources but are inaccessible via the API. It is also recommended to enhance test coverage to verify that the manual store cleanup is correctly executed.

gemini-code-assist · 2026-05-11T14:41:23Z

+	needRollbackSandbox = false
 	if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {
+
+		cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		defer cancel()
+		if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil {
+			klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr)
+		}
+
 		if isContextError(err) {
 			return nil, err
 		}
 		return nil, api.NewInternalError(fmt.Errorf("update store cache failed: %w", err))
 	}


This change introduces a potential resource leak and consistency issue. By setting needRollbackSandbox = false before the UpdateSandbox call, the Kubernetes pod is no longer deleted if the store update fails. However, the function still returns an error to the client (lines 280, 282), meaning the user receives a failure response and does not receive the sandbox entrypoints.

This results in a "zombie" pod that is running and consuming resources in the cluster but is unmanaged (since its store entry is deleted at line 275) and inaccessible to the user via the API. If the store update is critical enough to return an error, the entire operation should likely be rolled back to maintain consistency. If the pod is considered healthy enough to keep, consider returning the success response despite the store error, or implementing a background reconciliation to sync the store later.

gemini-code-assist · 2026-05-11T14:41:23Z

 			expectCreateCalls: 1,
 			expectUpdateCalls: 1,
-			expectDeleteCalls: 1,
+			expectDeleteCalls: 0,


While this change correctly reflects that the Kubernetes resource is no longer deleted when the store update fails, the test does not verify that the manual store cleanup (DeleteSandboxBySessionID) is actually performed. It is recommended to update the fakeStore (and the test assertions) to track calls to DeleteSandboxBySessionID to ensure that the store placeholder is properly removed in this failure scenario.

Copilot

Pull request overview

This PR adjusts Workload Manager sandbox creation rollback behavior so that a successfully started and verified Kubernetes sandbox is not deleted when the secondary store/cache update fails, and it attempts to clean up the store “placeholder” entry on that failure path.

Changes:

Clears needRollbackSandbox before storeClient.UpdateSandbox to prevent deleting a healthy K8s sandbox on store update failures.
On UpdateSandbox error, attempts to remove the store placeholder using DeleteSandboxBySessionID with a detached timeout context.
Updates the TestServerCreateSandbox table test to expect no K8s delete rollback when the store update fails.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
pkg/workloadmanager/handlers.go	Prevents rollback deletion of a verified sandbox on store update failure; adds best-effort store placeholder cleanup with a detached context.
pkg/workloadmanager/handlers_test.go	Updates the rollback expectations for the “update store fails” scenario.

+
+		cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		defer cancel()
+		if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil {
+			klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr)
+		}
+


codecov-commenter · 2026-05-11T14:44:59Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 54.54545% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.76%. Comparing base (524e55e) to head (800c2f3).
⚠️ Report is 41 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/workloadmanager/handlers.go	54.54%	8 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
+ Coverage   47.57%   47.76%   +0.19%     
==========================================
  Files          30       30              
  Lines        2819     2862      +43     
==========================================
+ Hits         1341     1367      +26     
- Misses       1338     1345       +7     
- Partials      140      150      +10

Flag	Coverage Δ
unittests	`47.76% <54.54%> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

 			expectUpdateCalls: 1,
-			expectDeleteCalls: 1,
+			expectDeleteCalls: 0,
 		},


 	if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {
+
+		cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+		defer cancel()
+		if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil {
+			klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr)
+		}


Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

+
+		go func(sessionID, ns, name string) {
+			cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+			defer cancel()
+			if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sessionID); delErr != nil {



+	needRollbackSandbox = false
 	if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {
+


+
+		go func(sessionID, ns, name string) {
+			cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+			defer cancel()
+			if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sessionID); delErr != nil {
+				klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", ns, name, sessionID, delErr)
+			}
+		}(sandboxEntry.SessionID, sandbox.Namespace, sandbox.Name)
+


Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 11, 2026 14:33

volcano-sh-bot added the kind/bug Something isn't working label May 11, 2026

volcano-sh-bot requested review from LiZhenCheng9527 and acsoto May 11, 2026 14:33

volcano-sh-bot added the size/S label May 11, 2026

fix(workloadmanager): prevent deletion of healthy sandbox on store ca…

0692ace

…che update failure Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>

Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 0184e26 to 0692ace Compare May 11, 2026 14:37

Copilot started reviewing on behalf of Abhinav-kodes May 11, 2026 14:38 View session

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

Copilot AI reviewed May 11, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 11, 2026 14:49

volcano-sh-bot added size/M and removed size/S labels May 11, 2026

Copilot started reviewing on behalf of Abhinav-kodes May 11, 2026 14:49 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 7b08ba9 to 8f79e5e Compare May 11, 2026 15:00

Abhinav-kodes requested a review from Copilot May 11, 2026 15:01

Copilot started reviewing on behalf of Abhinav-kodes May 11, 2026 15:02 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Abhinav-kodes requested a review from Copilot May 11, 2026 15:31

Copilot started reviewing on behalf of Abhinav-kodes May 11, 2026 15:32 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Abhinav-kodes requested a review from Copilot May 11, 2026 21:08

Copilot started reviewing on behalf of Abhinav-kodes May 11, 2026 21:09 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread pkg/workloadmanager/handlers.go

Comment thread pkg/workloadmanager/handlers_test.go

Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 8f79e5e to 1f392ff Compare May 11, 2026 21:47

Copilot AI review requested due to automatic review settings May 11, 2026 22:21

Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 1f392ff to 82cb9d0 Compare May 11, 2026 22:21

volcano-sh-bot added size/L and removed size/M labels May 11, 2026

Copilot started reviewing on behalf of Abhinav-kodes May 11, 2026 22:22 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

refactor: extract wait logic to reduce createSandbox complexity

800c2f3

Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>

Abhinav-kodes force-pushed the fix-sandbox-rollback branch from 82cb9d0 to 800c2f3 Compare May 11, 2026 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workloadmanager): prevent deletion of healthy sandbox on store ca…#320

fix(workloadmanager): prevent deletion of healthy sandbox on store ca…#320
Abhinav-kodes wants to merge 2 commits into
volcano-sh:mainfrom
Abhinav-kodes:fix-sandbox-rollback

Abhinav-kodes commented May 11, 2026

Uh oh!

volcano-sh-bot commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		needRollbackSandbox = false
		if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil {

Conversation

Abhinav-kodes commented May 11, 2026

Uh oh!

volcano-sh-bot commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented May 11, 2026 •

edited

Loading