Skip to content

feat(cleanup): support rate limiter for cleanup operation#6084

Merged
Xuanwo merged 12 commits intolance-format:mainfrom
zhangyue19921010:clean_rate_limit
Mar 9, 2026
Merged

feat(cleanup): support rate limiter for cleanup operation#6084
Xuanwo merged 12 commits intolance-format:mainfrom
zhangyue19921010:clean_rate_limit

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

@zhangyue19921010 zhangyue19921010 commented Mar 3, 2026

Closes #3291

    stats = dataset.cleanup_old_versions(
        older_than=(datetime.now() - moment), delete_rate_limit=100.0
    )

@github-actions github-actions Bot added enhancement New feature or request python java labels Mar 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

PR Review: feat(cleanup): support rate limiter for cleanup operation

P1: Panic with very small rate values

In rust/lance/src/dataset/cleanup.rs, the rate-limit application computes:

let duration = Duration::from_secs_f64(1.0 / rate);

The builder validation rejects non-positive and non-finite values, but a very small positive finite rate (e.g. 1e-20) passes validation, then 1.0 / 1e-20 = 1e20 overflows Duration::MAX (~1.8e19 seconds), causing a panic inside Duration::from_secs_f64.

Even more directly: rate = 1e-3001.0 / rate = f64::INFINITY → panic.

Suggested fix: Add a lower-bound check in the builder validation, or use Duration::try_from_secs_f64 (available since Rust 1.66) which returns a Result instead of panicking. For example:

let duration = Duration::try_from_secs_f64(1.0 / rate)
    .map_err(|e| Error::Cleanup {
        message: format!("delete_rate_limit {} is too small: {}", rate, e),
    })?;

P1: Test timing sensitivity

The Rust, Python, and Java tests all assert wall-clock elapsed time >= 2 seconds. Under CI load, timing-based assertions can be flaky. The Rust test in particular uses MockClock for version timestamps but relies on real std::time::Instant for the rate-limit assertion. Consider either:

  • Increasing the margin (e.g. assert >= 1 second instead of 2), or
  • Documenting that these tests may need #[ignore] if CI proves flaky.

This is a minor concern — just flagging for awareness.


Overall the approach is sound — using IntervalStream.zip() to throttle stream emission into remove_stream is clean and effective. The API surface across Rust/Python/Java is consistent, and the lance.auto_cleanup.delete_rate_limit config integration is a nice addition.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 77.02703% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 77.02% 16 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
/// Maximum number of delete operations per second. If None, no rate limiting is applied.
///
/// Use this to avoid hitting S3 (or other object store) request rate limits during cleanup.
/// For example, `Some(100.0)` limits deletions to 100 files per second.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our delete is using batch delete which is 1k files per op, so 100 limit her should be 100k files per second.

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
rate
);
let duration = Duration::try_from_secs_f64(1.0 / rate).map_err(|e| Error::Cleanup {
message: format!("delete_rate_limit {} is too small: {}", rate, e),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the rate to a value that is too small does not seem logical to me. We can instead set it to the smallest meaningful value. For example, we could set the rate's minimum to 1, which would mean one operation per second.

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
/// # Errors
///
/// Returns an error if `rate` is not a positive finite number.
pub fn delete_rate_limit(mut self, rate: f64) -> Result<Self> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to make this a float?

Comment thread python/python/tests/test_dataset.py Outdated
Comment on lines +1534 to +1538
time.sleep(1)
lance.write_dataset(table, base_dir, mode="overwrite")
time.sleep(1)
lance.write_dataset(table, base_dir, mode="overwrite")
time.sleep(1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these sleeps needed? Each write will get a unique timestamp. You can fetch the timestamps with a get_versions call. This will help avoid slow test times.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Hi @Xuanwo and @westonpace Thanks a lot for your view!
All comments are addressed. Also tested on S3 which will delete 3K files during clean. Result are as followed.

Scenario Condition Elapsed Time (ms)
Unthrottled No rate limit 3,422
Rate-limited 1 op/s 5,302

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Xuanwo Xuanwo merged commit aac74b4 into lance-format:main Mar 9, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tune cleanup_old_versions request rate to S3 rate limits

3 participants