feat: simple auto cleanup#3572
Conversation
|
Finally ready for review @wjones127 - better late than never? :) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3572 +/- ##
==========================================
+ Coverage 78.38% 78.40% +0.01%
==========================================
Files 261 261
Lines 99344 99507 +163
Branches 99344 99507 +163
==========================================
+ Hits 77871 78014 +143
- Misses 18356 18375 +19
- Partials 3117 3118 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
wjones127
left a comment
There was a problem hiding this comment.
This is great work. Have a few small tweaks, and then I think this will be ready.
This will be extra helpful once this is merged: #3572
8d9fd1a to
8fc711f
Compare
|
@wjones127 I've rebased on main and removed those logging tests. Ready for review. |
| #[derive(Debug, Clone)] | ||
| pub struct AutoCleanupParams { | ||
| pub interval: usize, | ||
| pub older_than: usize, |
There was a problem hiding this comment.
My one remaining concern is whether this is the right unit. Days seems a little coarse.
One option is to user a finer time unit, like minutes or hours.
But another nice option might be to take a standard duration format, such as 3d 2h 5m. 8601 has a standard for this, but it's a little ugly (P3Y6M4DT12H30M5S). Even just accepting one number and one unit seems like a welcome improvement, though, and I think I prefer that over a fixed unit, since it makes the configuration more readable.
There was a problem hiding this comment.
(Hopefully) resolved by 205f611. I added humantime::parse_duration for human-friendly duration parsing. While this adds an additional dependency to Cargo.toml, it turns out that Cargo.lock contains this dependency as part of object_store (see here). I've expanded out the test case to include a custom duration of "1month 2days 2h 42min 6sec" and it seems to work fine.
In WriteParams, I've updated the older_than argument to be of type chrono::TimeDelta (as opposed to usize). humantime::parse_duration works on std::time::Duration, not chrono::TimeDelta, but I chose chrono::TimeDelta because this then aligns with the type signature in cleanup_old_versions, at the cost of an additional conversion from chromo::TimeDelta to std::time::Duration.
8fc711f to
205f611
Compare
| { id = "RUSTSEC-2021-0153", reason = "`encoding` is used by lindera" }, | ||
| { id = "RUSTSEC-2024-0384", reason = "`instant` is used by tantivy" }, | ||
| { id = "RUSTSEC-2024-0436", reason = "`paste` is used by datafusion" }, | ||
| { id = "RUSTSEC-2025-0014", reason = "`humantime` is used by object_store" }, |
Recent speculation around the maintenance status of humantime led to [RUSTSEC-2025-0014](https://rustsec.org/advisories/RUSTSEC-2025-0014). This has since been withdrawn and so can be removed from `deny.toml`
6ba14dc to
17b7223
Compare
|
@wjones127 anything else needed to get this over the line? |
wjones127
left a comment
There was a problem hiding this comment.
Nice work here! Sorry for the delay in my final review.
| } | ||
|
|
||
| impl Default for AutoCleanupParams { | ||
| fn default() -> Self { |
There was a problem hiding this comment.
@dsgibbons Does it mean the auto cleanup takes effect by default? Is there a switch for this?
There was a problem hiding this comment.
@yanghua That's right. I do wonder whether enabling destructive actions should be the default. You can set auto_cleanup: None in your WriteParams if you want to disable this behaviour. A couple of the existing unit tests were altered in this way by this PR to disable auto cleanup.
There was a problem hiding this comment.
Can we set None as the default value? Users may not know this default behavior and may not expect it.
Closes #2100.
This PR introduces:
auto_cleanup_hook. If config keyslance.auto_cleanup.intervalandlance.auto_cleanup.older_thanare both set, then, everyn_versions % lance.auto_cleanup.interval,cleanup_old_versionswill automatically be called.auto_cleanup: Option<AutoCleanupParams>. IfSome, new datasets are configured with the necessarylance.auto_cleanupconfig keys. By default this sets theinterval=20andolder_than=14.This PR is Rust-only. I can add Python bindings in a future PR if desired.
From #2100:
AutoCleanupParams- let me know if you want this documented elsewhere)