feat: move rate limiting to the object store#6293
feat: move rate limiting to the object store#6293westonpace merged 5 commits intolance-format:mainfrom
Conversation
|
Keeping in draft until #6266 merges |
PR ReviewWell-designed change that moves rate limiting to the right layer. The AIMD algorithm, per-category throttles, and anti-thundering-herd token bucket are solid. A few concerns: P0: Breaking change without deprecationRemoving
P1:
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
ec5d3f2 to
ce0e646
Compare
…onfig The AIMD throttled store now retries throttle errors up to 3 times with random 100-300ms backoff, so the underlying cloud object store's built-in retry count is lowered from 10 to 3 to avoid redundant retries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ce0e646 to
0c98572
Compare
| storage_options: &StorageOptions, | ||
| is_s3_express: bool, | ||
| ) -> Result<Arc<dyn OSObjectStore>> { | ||
| let max_retries = storage_options.client_max_retries(); |
There was a problem hiding this comment.
client_max_retries is not useful anymore, do we need to remove it?
There was a problem hiding this comment.
It should still be used. We actually rely on object_store's retries to make the AIMD throttle work. The object_store errors have no "is_temporary" so we have no other way of knowing if an error is a temporary error except to see if object_store applied retries to it.
So by default we should do 3 object store retries now. Each time those fail we apply an AIMD retry (longer backoff, cut throttle). So in total we get 9 retries instead of 10.
There was a problem hiding this comment.
Seems multipart upload process is not covered. Is it expected?
There was a problem hiding this comment.
Good catch. I've added multipart to the retry handling. I still can't apply the outer retry loop to the delete stream or list stream methods. These return a stream of items and there is no way of mapping results to underlying object store requests so there is nothing I can retry there. Since these operations are hopefully rare I think it will be ok.
| @@ -354,7 +461,7 @@ impl AimdThrottledStore { | |||
| impl ObjectStore for AimdThrottledStore { | |||
There was a problem hiding this comment.
I think we should consider overriding our own rename_if_not_exists handling.
Currently, the logic retries the entire copy and delete process. If the copy succeeds but the delete fails due to throttling, the entire operation is retried. After that, the copy will fail because the object already exists.
There was a problem hiding this comment.
I'll create a follow-up, I don't think we use rename_if_not_exists in most cases anymore.
- Change client_max_retries default from 10 to 3 and restore configurability via OBJECT_STORE_CLIENT_MAX_RETRIES in cloud providers - Add LANCE_AIMD_MAX_RETRIES, LANCE_AIMD_MIN_BACKOFF_MS, LANCE_AIMD_MAX_BACKOFF_MS env vars (and storage option equivalents) to configure AIMD throttle retry behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `is_disabled()` to `AimdThrottleConfig` so that setting `lance_aimd_max_retries=0` bypasses the throttle layer entirely. Also warn and skip the layer when `client_max_retries=0`, since the AIMD implementation relies on the object store client surfacing retry errors to detect throttling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Retry put_part, complete, and abort in ThrottledMultipartUpload using LANCE_AIMD_MAX_RETRIES. put_part uses a std::sync::Mutex (not tokio) to share the inner upload across retry futures without risking deadlock from aborted futures. complete/abort use Arc::get_mut since &mut self guarantees exclusive access. - Update client_max_retries docs: default 10 → 3, s3 client → object store client. Same wording fix for client_retry_timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>



Closes #6239