Skip to content

feat: simple auto cleanup#3572

Merged
wjones127 merged 20 commits intolance-format:mainfrom
dsgibbons:feat/simple-auto-cleanup-v2
Apr 16, 2025
Merged

feat: simple auto cleanup#3572
wjones127 merged 20 commits intolance-format:mainfrom
dsgibbons:feat/simple-auto-cleanup-v2

Conversation

@dsgibbons
Copy link
Copy Markdown
Contributor

@dsgibbons dsgibbons commented Mar 20, 2025

Closes #2100.

This PR introduces:

  • A hook called auto_cleanup_hook. If config keys lance.auto_cleanup.interval and lance.auto_cleanup.older_than are both set, then, every n_versions % lance.auto_cleanup.interval, cleanup_old_versions will automatically be called.
  • A new write param auto_cleanup: Option<AutoCleanupParams>. If Some, new datasets are configured with the necessary lance.auto_cleanup config keys. By default this sets the interval=20 and older_than=14.
  • A test to ensure hook is automatically invoked. Some older tests have been fixed as they either do not assume autoclaving will take place, or assume the initial default config is empty.

This PR is Rust-only. I can add Python bindings in a future PR if desired.

From #2100:

  • Add configuration key for "auto_cleanup.interval", which is number of versions to have between running.
  • Add a configuration key for "auto_cleanup.older_than", which is the older_than parameter for cleanup_old_versions.
  • Double check documented default value of two weeks. Maybe it should be seven days. (two weeks seems reasonable to me?)
  • Add hook to do this in the commit path.
  • Document how to manually call cleanup_old_versions (not done yet - where would you like this documenting?)
  • Document how to configure auto cleanup and how to disable it. (done in the doc comment for AutoCleanupParams - let me know if you want this documented elsewhere)

@github-actions github-actions Bot added the enhancement New feature or request label Mar 20, 2025
@dsgibbons
Copy link
Copy Markdown
Contributor Author

Finally ready for review @wjones127 - better late than never? :)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 20, 2025

Codecov Report

Attention: Patch coverage is 86.30952% with 23 lines in your changes missing coverage. Please review.

Project coverage is 78.40%. Comparing base (be688a9) to head (17b7223).
Report is 245 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 86.88% 15 Missing and 1 partial ⚠️
rust/lance/src/dataset/write/insert.rs 83.33% 5 Missing ⚠️
rust/lance/src/io/commit.rs 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3572      +/-   ##
==========================================
+ Coverage   78.38%   78.40%   +0.01%     
==========================================
  Files         261      261              
  Lines       99344    99507     +163     
  Branches    99344    99507     +163     
==========================================
+ Hits        77871    78014     +143     
- Misses      18356    18375      +19     
- Partials     3117     3118       +1     
Flag Coverage Δ
unittests 78.40% <86.30%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work. Have a few small tweaks, and then I think this will be ready.

Comment thread rust/lance/src/io/commit.rs Outdated
Comment thread rust/lance/src/io/commit.rs Outdated
wjones127 added a commit that referenced this pull request Mar 24, 2025
This will be extra helpful once this is merged:
#3572
@dsgibbons dsgibbons force-pushed the feat/simple-auto-cleanup-v2 branch 4 times, most recently from 8d9fd1a to 8fc711f Compare March 28, 2025 22:17
@dsgibbons
Copy link
Copy Markdown
Contributor Author

@wjones127 I've rebased on main and removed those logging tests. Ready for review.

Comment thread rust/lance/src/dataset/write.rs Outdated
#[derive(Debug, Clone)]
pub struct AutoCleanupParams {
pub interval: usize,
pub older_than: usize,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one remaining concern is whether this is the right unit. Days seems a little coarse.

One option is to user a finer time unit, like minutes or hours.

But another nice option might be to take a standard duration format, such as 3d 2h 5m. 8601 has a standard for this, but it's a little ugly (P3Y6M4DT12H30M5S). Even just accepting one number and one unit seems like a welcome improvement, though, and I think I prefer that over a fixed unit, since it makes the configuration more readable.

Copy link
Copy Markdown
Contributor Author

@dsgibbons dsgibbons Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hopefully) resolved by 205f611. I added humantime::parse_duration for human-friendly duration parsing. While this adds an additional dependency to Cargo.toml, it turns out that Cargo.lock contains this dependency as part of object_store (see here). I've expanded out the test case to include a custom duration of "1month 2days 2h 42min 6sec" and it seems to work fine.

In WriteParams, I've updated the older_than argument to be of type chrono::TimeDelta (as opposed to usize). humantime::parse_duration works on std::time::Duration, not chrono::TimeDelta, but I chose chrono::TimeDelta because this then aligns with the type signature in cleanup_old_versions, at the cost of an additional conversion from chromo::TimeDelta to std::time::Duration.

@dsgibbons dsgibbons force-pushed the feat/simple-auto-cleanup-v2 branch from 8fc711f to 205f611 Compare April 7, 2025 11:44
@github-actions github-actions Bot added the python label Apr 7, 2025
Comment thread deny.toml
{ id = "RUSTSEC-2021-0153", reason = "`encoding` is used by lindera" },
{ id = "RUSTSEC-2024-0384", reason = "`instant` is used by tantivy" },
{ id = "RUSTSEC-2024-0436", reason = "`paste` is used by datafusion" },
{ id = "RUSTSEC-2025-0014", reason = "`humantime` is used by object_store" },
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RUSTSEC-2025-0014 for reference.

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
@dsgibbons dsgibbons force-pushed the feat/simple-auto-cleanup-v2 branch from 6ba14dc to 17b7223 Compare April 9, 2025 04:37
@dsgibbons
Copy link
Copy Markdown
Contributor Author

@wjones127 anything else needed to get this over the line?

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here! Sorry for the delay in my final review.

@wjones127 wjones127 merged commit c1b8069 into lance-format:main Apr 16, 2025
26 of 27 checks passed
}

impl Default for AutoCleanupParams {
fn default() -> Self {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsgibbons Does it mean the auto cleanup takes effect by default? Is there a switch for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua That's right. I do wonder whether enabling destructive actions should be the default. You can set auto_cleanup: None in your WriteParams if you want to disable this behaviour. A couple of the existing unit tests were altered in this way by this PR to disable auto cleanup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set None as the default value? Users may not know this default behavior and may not expect it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Thoughts @wjones127?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 Any thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua let's discuss this in a dedicated issue, so the discussion is easier to find. Most people don't go looking in comment on closed PRs.

I've started a discussion here:
#3854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simple auto cleanup

4 participants