feat: move rate limiting to the object store by westonpace · Pull Request #6293 · lance-format/lance

westonpace · 2026-03-25T13:56:20Z

westonpace · 2026-03-25T13:56:50Z

Keeping in draft until #6266 merges

github-actions · 2026-03-25T13:58:24Z

PR Review

Well-designed change that moves rate limiting to the right layer. The AIMD algorithm, per-category throttles, and anti-thundering-herd token bucket are solid. A few concerns:

P0: Breaking change without deprecation

Removing LANCE_PROCESS_IO_THREADS_LIMIT is a silent breaking change for users who have configured it. Consider:

Logging a warning at startup if LANCE_PROCESS_IO_THREADS_LIMIT is still set, telling users to migrate to the new LANCE_AIMD_* env vars
Or keeping it as a no-op with a deprecation warning for one release cycle

P1: `is_throttle_error` string matching is fragile

The heuristic of matching "retries, max_retries" in error messages is acknowledged as crude, but deserves more attention:

If object_store changes its error format, throttle detection silently breaks and AIMD never decreases rate — the worst failure mode
Consider also logging/counting unrecognized Generic errors that don't match the pattern, so operators can notice if the heuristic stops matching
Worth adding a code comment with the object_store version this was tested against so future maintainers know when to re-verify

P1: `ThrottledMultipartUpload::put_part` — token acquired after future is created

fn put_part(&mut self, data: PutPayload) -> UploadPart {
    let fut = self.target.put_part(data);  // called eagerly
    Box::pin(async move {
        write.acquire_token().await;       // throttle happens later
        let result = fut.await;
        ...
    })
}

self.target.put_part(data) is called before the token is acquired. If an underlying store implementation starts I/O eagerly inside put_part (rather than deferring to the returned future), the throttle is bypassed. This is likely fine for current object_store implementations but is subtle — a brief comment explaining the assumption would help.

Minor nits (non-blocking)

AimdController uses std::sync::Mutex with .unwrap() — a poisoned mutex will panic. Fine in practice but parking_lot::Mutex avoids this class of issue.
observe_outcome silently drops rate updates via try_lock when contended. The doc comment explains this, which is good — just noting that under sustained throttling with high contention, the rate decrease could be delayed.

Overall this is a solid improvement over the old semaphore-based approach. The per-store scoping, per-category isolation, and AIMD adaptiveness are all well-motivated by the linked issue.

codecov · 2026-03-25T14:25:38Z

Codecov Report

❌ Patch coverage is 61.56584% with 108 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-io/src/object_store/throttle.rs	60.58%	91 Missing and 17 partials ⚠️

📢 Thoughts on this report? Let us know!

…onfig The AIMD throttled store now retries throttle errors up to 3 times with random 100-300ms backoff, so the underlying cloud object store's built-in retry count is lowered from 10 to 3 to avoid redundant retries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Xuanwo · 2026-03-27T10:43:45Z

        storage_options: &StorageOptions,
        is_s3_express: bool,
    ) -> Result<Arc<dyn OSObjectStore>> {
-        let max_retries = storage_options.client_max_retries();


client_max_retries is not useful anymore, do we need to remove it?

It should still be used. We actually rely on object_store's retries to make the AIMD throttle work. The object_store errors have no "is_temporary" so we have no other way of knowing if an error is a temporary error except to see if object_store applied retries to it.

So by default we should do 3 object store retries now. Each time those fail we apply an AIMD retry (longer backoff, cut throttle). So in total we get 9 retries instead of 10.

Xuanwo · 2026-03-27T10:44:38Z

Seems multipart upload process is not covered. Is it expected?

Good catch. I've added multipart to the retry handling. I still can't apply the outer retry loop to the delete stream or list stream methods. These return a stream of items and there is no way of mapping results to underlying object store requests so there is nothing I can retry there. Since these operations are hopefully rare I think it will be ok.

Xuanwo · 2026-03-27T10:48:42Z

@@ -354,7 +461,7 @@ impl AimdThrottledStore {
 impl ObjectStore for AimdThrottledStore {


I think we should consider overriding our own rename_if_not_exists handling.

Currently, the logic retries the entire copy and delete process. If the copy succeeds but the delete fails due to throttling, the entire operation is retried. After that, the copy will fail because the object already exists.

I'll create a follow-up, I don't think we use rename_if_not_exists in most cases anymore.

- Change client_max_retries default from 10 to 3 and restore configurability via OBJECT_STORE_CLIENT_MAX_RETRIES in cloud providers - Add LANCE_AIMD_MAX_RETRIES, LANCE_AIMD_MIN_BACKOFF_MS, LANCE_AIMD_MAX_BACKOFF_MS env vars (and storage option equivalents) to configure AIMD throttle retry behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add `is_disabled()` to `AimdThrottleConfig` so that setting `lance_aimd_max_retries=0` bypasses the throttle layer entirely. Also warn and skip the layer when `client_max_retries=0`, since the AIMD implementation relies on the object store client surfacing retry errors to detect throttling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

westonpace · 2026-03-27T13:04:14Z

Test results:

Test Description

On Azure, 40 VMs (8 cores each) hammer the same blob with random takes, each take grabbing 1024 rows. Each VM has 8 processes so there are a total of 320 processes taking from the same blob. This blob is in a standard class storage account. This should aggressively trigger rate limiting errors. Everything runs for 5 minutes.

With AIMD disabled, previous settings (note, red line is success in this chart for some reason)

8,330 failed takes, 50,176 rows taken

With AIMD enabled, default settings

0 Errors, 9,951,232 rows taken

Aggressive AIMD (1 object_store retry, 5 AIMD retries)

0 Errors, 7,606,272 rows taken

(will be updated with additional results)

- Retry put_part, complete, and abort in ThrottledMultipartUpload using LANCE_AIMD_MAX_RETRIES. put_part uses a std::sync::Mutex (not tokio) to share the inner upload across retry futures without risking deadlock from aborted futures. complete/abort use Arc::get_mut since &mut self guarantees exclusive access. - Update client_max_retries docs: default 10 → 3, s3 client → object store client. Same wording fix for client_retry_timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Xuanwo

Thank you!

westonpace marked this pull request as draft March 25, 2026 13:56

github-actions Bot added the enhancement New feature or request label Mar 25, 2026

westonpace force-pushed the feat/throttle-in-obj-store branch from ec5d3f2 to ce0e646 Compare March 25, 2026 14:59

westonpace marked this pull request as ready for review March 25, 2026 15:00

westonpace and others added 2 commits March 26, 2026 16:30

Initial attempt at moving rate limiting into object stores

fb2e775

westonpace force-pushed the feat/throttle-in-obj-store branch from ce0e646 to 0c98572 Compare March 27, 2026 01:50

Xuanwo reviewed Mar 27, 2026

View reviewed changes

westonpace and others added 2 commits March 27, 2026 05:01

Xuanwo approved these changes Mar 27, 2026

View reviewed changes

westonpace merged commit 209f99b into lance-format:main Mar 27, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move rate limiting to the object store#6293

feat: move rate limiting to the object store#6293
westonpace merged 5 commits intolance-format:mainfrom
westonpace:feat/throttle-in-obj-store

westonpace commented Mar 25, 2026

Uh oh!

westonpace commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

codecov Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Xuanwo Mar 27, 2026

Uh oh!

westonpace Mar 27, 2026

Uh oh!

Xuanwo Mar 27, 2026

Uh oh!

westonpace Mar 27, 2026

Uh oh!

Xuanwo Mar 27, 2026

Uh oh!

westonpace Mar 27, 2026

Uh oh!

westonpace commented Mar 27, 2026 •

edited

Loading

Uh oh!

Xuanwo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -354,7 +461,7 @@ impl AimdThrottledStore {
		impl ObjectStore for AimdThrottledStore {

Conversation

westonpace commented Mar 25, 2026

Uh oh!

westonpace commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

PR Review

P0: Breaking change without deprecation

P1: is_throttle_error string matching is fragile

P1: ThrottledMultipartUpload::put_part — token acquired after future is created

Minor nits (non-blocking)

Uh oh!

codecov Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xuanwo Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Description

With AIMD disabled, previous settings (note, red line is success in this chart for some reason)

With AIMD enabled, default settings

Aggressive AIMD (1 object_store retry, 5 AIMD retries)

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

P1: `is_throttle_error` string matching is fragile

P1: `ThrottledMultipartUpload::put_part` — token acquired after future is created

codecov Bot commented Mar 25, 2026 •

edited

Loading

westonpace commented Mar 27, 2026 •

edited

Loading