fix(workloadmanager): prevent deletion of healthy sandbox on store ca…#320
fix(workloadmanager): prevent deletion of healthy sandbox on store ca…#320Abhinav-kodes wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…che update failure Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>
0184e26 to
0692ace
Compare
There was a problem hiding this comment.
Code Review
This pull request modifies the sandbox creation workflow by introducing manual store cleanup when a store update fails and preventing the automatic rollback of the sandbox resource. Feedback highlights a significant consistency risk where moving the rollback flag results in unmanaged 'zombie' pods that consume resources but are inaccessible via the API. It is also recommended to enhance test coverage to verify that the manual store cleanup is correctly executed.
| needRollbackSandbox = false | ||
| if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil { | ||
|
|
||
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil { | ||
| klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr) | ||
| } | ||
|
|
||
| if isContextError(err) { | ||
| return nil, err | ||
| } | ||
| return nil, api.NewInternalError(fmt.Errorf("update store cache failed: %w", err)) | ||
| } |
There was a problem hiding this comment.
This change introduces a potential resource leak and consistency issue. By setting needRollbackSandbox = false before the UpdateSandbox call, the Kubernetes pod is no longer deleted if the store update fails. However, the function still returns an error to the client (lines 280, 282), meaning the user receives a failure response and does not receive the sandbox entrypoints.
This results in a "zombie" pod that is running and consuming resources in the cluster but is unmanaged (since its store entry is deleted at line 275) and inaccessible to the user via the API. If the store update is critical enough to return an error, the entire operation should likely be rolled back to maintain consistency. If the pod is considered healthy enough to keep, consider returning the success response despite the store error, or implementing a background reconciliation to sync the store later.
| expectCreateCalls: 1, | ||
| expectUpdateCalls: 1, | ||
| expectDeleteCalls: 1, | ||
| expectDeleteCalls: 0, |
There was a problem hiding this comment.
While this change correctly reflects that the Kubernetes resource is no longer deleted when the store update fails, the test does not verify that the manual store cleanup (DeleteSandboxBySessionID) is actually performed. It is recommended to update the fakeStore (and the test assertions) to track calls to DeleteSandboxBySessionID to ensure that the store placeholder is properly removed in this failure scenario.
There was a problem hiding this comment.
Pull request overview
This PR adjusts Workload Manager sandbox creation rollback behavior so that a successfully started and verified Kubernetes sandbox is not deleted when the secondary store/cache update fails, and it attempts to clean up the store “placeholder” entry on that failure path.
Changes:
- Clears
needRollbackSandboxbeforestoreClient.UpdateSandboxto prevent deleting a healthy K8s sandbox on store update failures. - On
UpdateSandboxerror, attempts to remove the store placeholder usingDeleteSandboxBySessionIDwith a detached timeout context. - Updates the
TestServerCreateSandboxtable test to expect no K8s delete rollback when the store update fails.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/workloadmanager/handlers.go | Prevents rollback deletion of a verified sandbox on store update failure; adds best-effort store placeholder cleanup with a detached context. |
| pkg/workloadmanager/handlers_test.go | Updates the rollback expectations for the “update store fails” scenario. |
|
|
||
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil { | ||
| klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr) | ||
| } | ||
|
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #320 +/- ##
==========================================
+ Coverage 47.57% 47.76% +0.19%
==========================================
Files 30 30
Lines 2819 2862 +43
==========================================
+ Hits 1341 1367 +26
- Misses 1338 1345 +7
- Partials 140 150 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| expectUpdateCalls: 1, | ||
| expectDeleteCalls: 1, | ||
| expectDeleteCalls: 0, | ||
| }, |
| if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil { | ||
|
|
||
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sandboxEntry.SessionID); delErr != nil { | ||
| klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", sandbox.Namespace, sandbox.Name, sandboxEntry.SessionID, delErr) | ||
| } |
7b08ba9 to
8f79e5e
Compare
8f79e5e to
1f392ff
Compare
1f392ff to
82cb9d0
Compare
|
|
||
| go func(sessionID, ns, name string) { | ||
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sessionID); delErr != nil { |
|
|
||
| needRollbackSandbox = false | ||
| if err := s.storeClient.UpdateSandbox(ctx, storeCacheInfo); err != nil { | ||
|
|
|
|
||
| go func(sessionID, ns, name string) { | ||
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
| if delErr := s.storeClient.DeleteSandboxBySessionID(cleanupCtx, sessionID); delErr != nil { | ||
| klog.Errorf("sandbox %s/%s failed to clean up store placeholder for session %s: %v", ns, name, sessionID, delErr) | ||
| } | ||
| }(sandboxEntry.SessionID, sandbox.Namespace, sandbox.Name) | ||
|
|
Signed-off-by: Abhinav-kodes <183825080+Abhinav-kodes@users.noreply.github.com>
82cb9d0 to
800c2f3
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
Currently, in
createSandbox, if a sandbox successfully starts, receives a Pod IP, and has its entrypoints verified, a subsequent failure ins.storeClient.UpdateSandboxwill trigger the deferred rollback function. This results in the healthy Kubernetes pod being deleted purely due to a transient cache/store write failure.Because the K8s cluster is the source of truth and the workload is already running and verified, it should not be destroyed if the secondary store fails to update.
This PR:
needRollbackSandbox = falseclearance above theUpdateSandboxcall to prevent destroying verified, running K8s workloads.DeleteSandboxBySessionID) using a new, detached context ifUpdateSandboxfails. This ensures we don't leave a stale "placeholder" entry in the store, while safely leaving the pod alive.Special notes for your reviewer:
TestServerCreateSandbox/update_store_fails_triggers_rollbackoriginally expected1delete call, which effectively enforced the bug. I updated this test to expect0delete calls to enforce the correct architectural behavior (a healthy pod should not be deleted on a store update failure).context.WithTimeout(context.Background(), ...)because the original requestctxmay have already been canceled (e.g., client disconnected/timed out), which would cause the cleanup to fail instantly if reused.Does this PR introduce a user-facing change?: