Skip to content

Cleanup scans unrelated directories and might clean unmanaged data. #5294

@majin1102

Description

@majin1102

We have a scenario that stores big video/audio files outside the format(there's a url field) but under the dataset root. The cleanup procedure would scan this data and might clean them unexpected:

let unreferenced_paths = self
.dataset
.object_store
.read_dir_all(
&self.dataset.base,
inspection.earliest_retained_manifest_time,
)
.try_filter_map(|obj_meta| {
// If a file is new-ish then it might be part of an ongoing operation and so we only
// delete it if we can verify it is part of an old version.
let maybe_in_progress = !self.policy.delete_unverified
&& obj_meta.last_modified >= verification_threshold;
let path_to_remove =
self.path_if_not_referenced(obj_meta.location, maybe_in_progress, &inspection);
if matches!(path_to_remove, Ok(Some(..))) {
removal_stats.lock().unwrap().bytes_removed += obj_meta.size;
}
future::ready(path_to_remove)
})
.boxed();

The default behivious is only clean files that is verified as expired/old by setting this parameter: delete_unverified false. But I don't think this parameter could expect to clean data unmanaged by table format when it is set true.

The other issue is now we have branches stored under the tree directory. I think we should exclude this directory from cleanup scanning cause each branch would have its own cleanup.

So I suggest to control the scanning scope within dirs including _version, data, _index, _transaction for safety and performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions