feat: support dynamic storage options provider with AWS credentials vending#4905
feat: support dynamic storage options provider with AWS credentials vending#4905jackye1995 merged 4 commits intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4905 +/- ##
==========================================
- Coverage 81.77% 81.73% -0.05%
==========================================
Files 340 341 +1
Lines 140102 140593 +491
Branches 140102 140593 +491
==========================================
+ Hits 114568 114907 +339
- Misses 21729 21867 +138
- Partials 3805 3819 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4fedbd1 to
0169ede
Compare
wjones127
left a comment
There was a problem hiding this comment.
Got stuck trying to understand the basics of how this works.
| /// and expires_at_millis is the epoch time in milliseconds when credentials expire | ||
| async fn get_credentials(&self) -> Result<(HashMap<String, String>, u64)>; |
There was a problem hiding this comment.
Could we use a proper time type here, like std::time::Instant?
| /// and expires_at_millis is the epoch time in milliseconds when credentials expire | |
| async fn get_credentials(&self) -> Result<(HashMap<String, String>, u64)>; | |
| /// and expires_at_millis is the epoch time in milliseconds when credentials expire | |
| async fn get_credentials(&self) -> Result<(HashMap<String, String>, Instant)>; |
| /// How early to refresh credentials before expiration (in milliseconds) | ||
| /// Default: 300,000 (5 minutes) | ||
| pub refresh_lead_time_ms: u64, |
There was a problem hiding this comment.
Similarly, can we use a proper type here like std::time::Duration?
| /// How early to refresh credentials before expiration (in milliseconds) | |
| /// Default: 300,000 (5 minutes) | |
| pub refresh_lead_time_ms: u64, | |
| /// How early to refresh credentials before expiration (in milliseconds) | |
| /// Default: 300,000 (5 minutes) | |
| pub refresh_lead_time_ms: Duration, |
| pub struct DelegatingObjectStore { | ||
| wrapper: Arc<CredentialVendingObjectStoreWrapper>, | ||
| inner: Arc<dyn OSObjectStore>, | ||
| } |
There was a problem hiding this comment.
How are the credentials passed to the object store?
There was a problem hiding this comment.
I've been looking through this code, and I haven't been able to figure out how the credentials are passed to the actual object store instance? I see them being refreshed to the side, but I don't see how the object store is supposed to pick up the new credentials.
What I would expect is somewhere you build a CredentialProvider and then you pass it into ObjectStore::with_credentials() when constructing the object store. For example, that's how we have credential refresh for AWS working right now:
https://github.com/lancedb/lance/blob/d342d5db2da90edcbcf4bd88aa38985bcb111aa8/rust/lance-io/src/object_store/providers/aws.rs#L58-L64
https://github.com/lancedb/lance/blob/d342d5db2da90edcbcf4bd88aa38985bcb111aa8/rust/lance-io/src/object_store/providers/aws.rs#L80-L82
d254ba9 to
4443b04
Compare
|
Actually it might be better to call this feature more generically, like |
52cf400 to
f44ee32
Compare
1a26702 to
28e26e8
Compare
…lder (#5045) I ended up doing these in #4984 and #4905 so I decided to pull it out and get it cleaned up first. This PR moves the directory namespace from using OpenDAL directly to using Lance ObjectStore. This avoids the inconsistency between the dir namespace and the underlying lance table storage configurations. User can still use OpenDAL, and if we fully migrate Lance to OpenDAL it will be applied to both layers at the same time as well. The PR also improves the builder of the namespaces with builder style and allow supplying a Lance session. Since we have not published a stable version yet, we do not care about backwards compatibility. This PR also ensures the lance-namespace-impls features are consistent with lance-io features. Related to #5042
7ab0258 to
c3ffec3
Compare
d7631ae to
fda3dda
Compare
There was a problem hiding this comment.
💡 Codex Review
https://github.com/lancedb/lance/blob/e1f2a2cea032d1e2412f0ba5e32400d47d5dd633/rust/lance-io/src/object_store.rs#L224-L279
Exclude storage_options_provider from ObjectStoreParams cache key
The new dynamic credential provider is carried on ObjectStoreParams.storage_options_provider, but the Hash/PartialEq implementations used by the object‑store registry still ignore this field. Two datasets that point to the same bucket but use different StorageOptionsProvider implementations will hash to the same key and be treated as equal, so the registry can return a cached ObjectStore initialized with the wrong provider. This means credential refreshes (and even initial credentials) for dataset A can be reused for dataset B, leading to authentication failures or credential leakage across tables. The provider pointer should be included in Hash/eq to ensure object store caching respects different credential sources.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Good point. Added the rust one locally, pending merging python and java one. |
2ca5848 to
05fd868
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for working on this!
wjones127
left a comment
There was a problem hiding this comment.
Is there a specific use case for the dynamic storage options beyond credentials? If not, I think I'd prefer to make this just a credential provider.
| /// Trait for providing storage options with expiration tracking | ||
| /// | ||
| /// Implementations can fetch storage options from various sources (namespace servers, | ||
| /// secret managers, etc.) and are usable from Python/Java. | ||
| /// | ||
| /// # Equality and Hashing | ||
| /// | ||
| /// Implementations must provide `provider_id()` which returns a unique identifier for | ||
| /// equality and hashing purposes. Two providers with the same ID are considered equal | ||
| /// and will share the same cached ObjectStore in the registry. | ||
| #[async_trait] | ||
| pub trait StorageOptionsProvider: Send + Sync + fmt::Debug { |
There was a problem hiding this comment.
Is there a reason to make this generic storage options? If not, I think we should narrow it to just be credentials.
There was a problem hiding this comment.
In near term, might not be.
If I think from what was possible in similar features in Iceberg, there are things like
- bucket alias, that user can say "for this bucket name, swap to use another one temporarily"
- s3 tags, user can configure some dynamic tags to be assigned to the files written by specific sessions
That could be enabled.
Basically everything that is controllable in the Iceberg FileIO can be controlled dynamically in such way, so I was thinking about making equivalent changes in Lance for ObjectStore.
But I don't have a strong opinion, it's probably fine to also just do the features case by case.
There was a problem hiding this comment.
Okay I'm more open to this. The tags idea seems cool. I also could see a future where you might want the catalog to tell you which KMS keys to use for an S3 bucket. 👍
| // Case 4: provider + credentials without expiration - FAIL | ||
| (None, Some(_), _) => Err(Error::IO { | ||
| source: Box::new(std::io::Error::other( | ||
| "expires_at_millis is required when using storage_options_provider with credentials", | ||
| )), | ||
| location: location!(), | ||
| }), |
There was a problem hiding this comment.
I think we can trust if they pass a credential provider they already handled expiration, right?
There was a problem hiding this comment.
Yes, this is more for completeness
| pub struct DynamicStorageOptionsCredentialProvider { | ||
| provider: Arc<dyn StorageOptionsProvider>, | ||
| cache: Arc<RwLock<Option<CachedCredential>>>, | ||
| refresh_offset: Duration, | ||
| } |
There was a problem hiding this comment.
If I wanted to be fancy, I would do something like:
pub struct DynamicCredentials(pub Arc<HashMap<String, String>>);
pub struct NamespaceCredentialsProvider {
provider: Arc<dyn StorageOptionsProvider>,
cache: Arc<RwLock<Option<CachedCredential>>>,
refresh_offset: Duration,
}
impl CredentialProvider for NamespaceCredentialsProvider<AwsCredentials> {
type Credential = AwsCredentials;
async fn get_credential(&self) -> ObjectStoreResult<Arc<Self::Credential>> {
self.fetch_credential::<Self::Credential>().await
}
}
impl TryFrom<DynamicCredentials> for AwsCredentials {}
impl TryFrom<DynamicCredentials> for AzureCredentials {}
impl TryFrom<DynamicCredentials> for GcpCredentials {}
impl<T: TryFrom<DynamicCredentials>> NamespaceCredentialsProvider {
async fn fetch_credential<T>(&self) -> Result<T> {
let credential = todo!();
credential.try_into()
}
}Then you don't need to repeat this adapter for all three clouds. The only thing that's cloud specific is how to map from DynamicCredentials to the particular credentials type.
There was a problem hiding this comment.
yeah I actually originally implemented it that way. But I tried on GCP and it did not really work out, because GCP does not really provide remote temporary credentials and the generation of temporary token and signing is done all locally. I need to look more into it so I ended up just do it for AWS for now. Let me know if you prefer to first set it up this way.
There was a problem hiding this comment.
Ah okay. That's fine if it's hard to do. Can also be a future refactor.
ObjectStore has this TokenProvider internal trait they use. Not sure if you saw that already:
| storage_options_provider: Option<Arc<dyn StorageOptionsProvider>>, | ||
| expires_at_millis: Option<u64>, |
There was a problem hiding this comment.
- It seems like
storage_options_providerandcredentialsshould be mutually exclusive? Either I want the credentials to be controlled bystorage_options_provideror I want to pass my own customerAwsCredentialProvider. - I'm not sure why we need
expires_at_millis. My understanding is that the finalAwsCredentialProvider::get_credentials()is called on every request, so the expiration could be handled internally, right?
There was a problem hiding this comment.
This is actually intentionally not mutually exclusive. What I am trying to achieve here is that this is the initial value. Because initially when we try to load a dataset, we already want to call namespace.describeTable once, which gives the table location + storage options that contains the initial credentials and expiration time. I am basically leveraging the existing fields to pass those in, so that I don't need to do another duplicated namespace.describeTable call in order to just fetch another new set of credential and expiration time. Once the initial usage is done, these fields are not respected.
There was a problem hiding this comment.
Ah okay. Could you explain that in a comment?
wjones127
left a comment
There was a problem hiding this comment.
Thanks for your responses. I think this is good, but I'd suggest adding some comments the explain the stuff I had questions about.
beinan
left a comment
There was a problem hiding this comment.
We're looking forward to this feature, thanks!
5f77b4b to
7855872
Compare
This PR extends upon #4905 for 2 features: 1. when writing, the dataset might not exist yet, we add convenient method `write_into_namespace` which calls `namespace.create_empty_table` to create the empty table, get its URI and storage options to use (if any), and use that to create the table. 2. pass in storage options provider to: 1. write_fragments for distributed write 2. Lance file writer for callers that write individual lance files in the same table directory but not in the lance table note: this PR only offers rust and python implementation, java will be added later
…lder (lance-format#5045) I ended up doing these in lance-format#4984 and lance-format#4905 so I decided to pull it out and get it cleaned up first. This PR moves the directory namespace from using OpenDAL directly to using Lance ObjectStore. This avoids the inconsistency between the dir namespace and the underlying lance table storage configurations. User can still use OpenDAL, and if we fully migrate Lance to OpenDAL it will be applied to both layers at the same time as well. The PR also improves the builder of the namespaces with builder style and allow supplying a Lance session. Since we have not published a stable version yet, we do not care about backwards compatibility. This PR also ensures the lance-namespace-impls features are consistent with lance-io features. Related to lance-format#5042
…ending (lance-format#4905) This PR introduces a new dynamic storage options provider interface in Lance dataset. The main idea is that the provider describes the storage options to use for a given dataset, and when these options will expire. Lance is responsible for fetching another new set of storage options to re-initialize the object store when expiration happens. This is mainly useful for cases where the dataset's access credentials are temporary, and the user would like to invoke a specific credentials endpoint to fetch a new set of credentials. Currently we have only added support for AWS by implementing a credentials provider. The PR also provides an implementation of the `StorageOptionsProvider` with Lance Namespace, because Lance Namespace provides a DescribeTable endpoint which returns the storage options that should be used at the given time. Based on the namespace spec, a `expires_at_millis` key can be added to the storage options to indicate the expiration time of those options. Because Lance Namespace provides native implementations in python and Java, we also provides binding interfaces of of `PyStorageOptionsProvider` and `JavaStorageOptionsProvider`, which then further integrated with Lance Namespace implemented in those specific languages.
This PR extends upon lance-format#4905 for 2 features: 1. when writing, the dataset might not exist yet, we add convenient method `write_into_namespace` which calls `namespace.create_empty_table` to create the empty table, get its URI and storage options to use (if any), and use that to create the table. 2. pass in storage options provider to: 1. write_fragments for distributed write 2. Lance file writer for callers that write individual lance files in the same table directory but not in the lance table note: this PR only offers rust and python implementation, java will be added later
This PR introduces a new dynamic storage options provider interface in Lance dataset. The main idea is that the provider describes the storage options to use for a given dataset, and when these options will expire. Lance is responsible for fetching another new set of storage options to re-initialize the object store when expiration happens.
This is mainly useful for cases where the dataset's access credentials are temporary, and the user would like to invoke a specific credentials endpoint to fetch a new set of credentials. Currently we have only added support for AWS by implementing a credentials provider.
The PR also provides an implementation of the
StorageOptionsProviderwith Lance Namespace, because Lance Namespace provides a DescribeTable endpoint which returns the storage options that should be used at the given time. Based on the namespace spec, aexpires_at_milliskey can be added to the storage options to indicate the expiration time of those options.Because Lance Namespace provides native implementations in python and Java, we also provides binding interfaces of of
PyStorageOptionsProviderandJavaStorageOptionsProvider, which then further integrated with Lance Namespace implemented in those specific languages.