We have a scenario that stores big video/audio files outside the format(there's a url field) but under the dataset root. The cleanup procedure would scan this data and might clean them unexpected:
|
let unreferenced_paths = self |
|
.dataset |
|
.object_store |
|
.read_dir_all( |
|
&self.dataset.base, |
|
inspection.earliest_retained_manifest_time, |
|
) |
|
.try_filter_map(|obj_meta| { |
|
// If a file is new-ish then it might be part of an ongoing operation and so we only |
|
// delete it if we can verify it is part of an old version. |
|
let maybe_in_progress = !self.policy.delete_unverified |
|
&& obj_meta.last_modified >= verification_threshold; |
|
let path_to_remove = |
|
self.path_if_not_referenced(obj_meta.location, maybe_in_progress, &inspection); |
|
if matches!(path_to_remove, Ok(Some(..))) { |
|
removal_stats.lock().unwrap().bytes_removed += obj_meta.size; |
|
} |
|
future::ready(path_to_remove) |
|
}) |
|
.boxed(); |
The default behivious is only clean files that is verified as expired/old by setting this parameter: delete_unverified false. But I don't think this parameter could expect to clean data unmanaged by table format when it is set true.
The other issue is now we have branches stored under the tree directory. I think we should exclude this directory from cleanup scanning cause each branch would have its own cleanup.
So I suggest to control the scanning scope within dirs including _version, data, _index, _transaction for safety and performance.
We have a scenario that stores big video/audio files outside the format(there's a url field) but under the dataset root. The cleanup procedure would scan this data and might clean them unexpected:
lance/rust/lance/src/dataset/cleanup.rs
Lines 261 to 280 in 737c394
The default behivious is only clean files that is verified as expired/old by setting this parameter:
delete_unverifiedfalse. But I don't think this parameter could expect to clean data unmanaged by table format when it is set true.The other issue is now we have branches stored under the
treedirectory. I think we should exclude this directory from cleanup scanning cause each branch would have its own cleanup.So I suggest to control the scanning scope within dirs including
_version, data, _index, _transactionfor safety and performance.