Skip to content

Conversation

@nolancon
Copy link
Collaborator

@nolancon nolancon commented Nov 24, 2025

Description of your changes

Orphaned S3 buckets (i.e. S3 bucket exists on Ceph backend(s) but CR has been fully removed) have been discovered on 2/3 occasions over the past year.

Upon investigation, it turns out that in Provider Ceph's delete flow, as it iterates over the backends from which to delete a bucket, it skips over backends on which the bucket is "unavailable" (as per the CR status). The thing is that "unavailable" doesn't equate to "doesn't exist" - it just means that it's unreachable.
So a scenario could occur where a user creates a bucket while an RGW is unreachable and, as a result, Provider Ceph marks the bucket on that backend as "unavailable". But the bucket is created elsewhere so is still "ready" for the user. Then the user attempts to delete the bucket before Provider Ceph has a chance to fully sync all backends, and so this "unavailable" bucket gets skipped over during the deletion loop.

I don't know/remember why this check/skip was implemented. Either (1) it's just a mistake or (2) it was an attempt to avoid hanging CRs - if this is the case then I think hanging CRs is a much lesser evil than orphaned s3 buckets.

There is also a check and skip over a backend which no longer exists in the Status - this is valid, as a deleting an S3 bucket should also mean that it is removed from the Status.

I have:

  • Run make ready-for-review to ensure this PR is ready for review.
  • Run make ceph-chainsaw to validate these changes against Ceph. This step is not always necessary. However, for changes related to S3 calls it is sensible to validate against an actual Ceph cluster. Localstack is used in our CI Chainsaw suite for convenience and there can be disparity in S3 behaviours between it and Ceph. See docs/TESTING.md for information on how to run tests against a Ceph cluster.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

@nolancon nolancon force-pushed the deletion-remove-skips branch from a64da7f to 1afd535 Compare November 24, 2025 11:07
@nolancon nolancon changed the title Remove skips/checks from delete flow Remove check for unavailable buckets from delete flow Nov 24, 2025
@nolancon nolancon requested a review from Copilot November 24, 2025 11:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where S3 buckets could be orphaned on backends during deletion when those backends were temporarily unreachable. The deletion logic previously skipped backends marked as "unavailable," which meant buckets on temporarily unreachable backends wouldn't be deleted, leading to orphaned resources.

Key Changes:

  • Removed the check that skipped deletion attempts on "unavailable" backends, allowing deletion to proceed regardless of backend availability status
  • Added comprehensive e2e test coverage for the scenario of disabling/deleting buckets when backends are unhealthy

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/controller/bucket/delete.go Removed the condition that skipped deletion on unavailable backends, ensuring all backends are attempted during deletion
e2e/tests/stable/chainsaw-test.yaml Added test cases verifying bucket disablement and deletion behavior when backends are unhealthy and then recover

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nolancon nolancon marked this pull request as ready for review November 24, 2025 11:18
@nolancon nolancon merged commit 11cb232 into main Nov 26, 2025
10 checks passed
@nolancon nolancon deleted the deletion-remove-skips branch November 26, 2025 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants