fix: prevent cache collisions in ObjectStoreRegistry#5153
Conversation
ObjectStoreRegistry maintains a cache of ObjectStore objects. Previously, when caching an Azure blob store, the account name was not part of the cache key. This could result in collisions in the case where we tried to cache two Azure blob stores using the same container name, but different account names. Container names are not unique in Azure; only the combination of account name, container name is. The flow here is roughly that we determine what type of object store we're creating by looking at the first part of the URL (the scheme), and then use the associated object store Provider object to generate the appropriate store_prefix from the second part of the URL (the authority). We want to be able to do all this relatively quickly, without instantiating heavyweight objects, to support the cache key use-case. In the specific case of Azure, the account name may not even be in the URL at all. It can be passed as part of the storage options. Therefore, the function generating the object store prefix must also take the storage_options as a parameter. The final object store prefix we generate for Azure must include this account information, as well as the container information. This PR modifies WrappingObjectStore.wrap() to take the store prefix as an argument, rather than the storage options. In general, the storage options are highly specific to the type of the object store, which the wrapper generally doesn't know. It would not make sense to look for Azure-specific storage options when wrapping an s3 store, for example.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5153 +/- ##
========================================
Coverage 81.85% 81.85%
========================================
Files 341 341
Lines 140548 140854 +306
Branches 140548 140854 +306
========================================
+ Hits 115046 115298 +252
- Misses 21694 21743 +49
- Partials 3808 3813 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
wjones127
left a comment
There was a problem hiding this comment.
This looks good. I like the idea of Pushing the responsibility of creating a unique cache key on the provider rather than exposing the storage options to the wrapper store 👍
ObjectStoreRegistry maintains a cache of ObjectStore objects. Previously, when caching an Azure blob store, the account name was not part of the cache key. This could result in collisions in the case where we tried to cache two Azure blob stores using the same container name, but different account names. Container names are not unique in Azure; only the combination of account name, container name is. The flow here is roughly that we determine what type of object store we're creating by looking at the first part of the URL (the scheme), and then use the associated object store Provider object to generate the appropriate store_prefix from the second part of the URL (the authority). We want to be able to do all this relatively quickly, without instantiating heavyweight objects, to support the cache key use-case. In the specific case of Azure, the account name may not even be in the URL at all. It can be passed as part of the storage options. Therefore, the function generating the object store prefix must also take the storage_options as a parameter. The final object store prefix we generate for Azure must include this account information, as well as the container information. This PR modifies WrappingObjectStore.wrap() to take the store prefix as an argument, rather than the storage options. In general, the storage options are highly specific to the type of the object store, which the wrapper generally doesn't know. It would not make sense to look for Azure-specific storage options when wrapping an s3 store, for example.
ObjectStoreRegistry maintains a cache of ObjectStore objects. Previously, when caching an Azure blob store, the account name was not part of the cache key. This could result in collisions in the case where we tried to cache two Azure blob stores using the same container name, but different account names. Container names are not unique in Azure; only the combination of account name, container name is.
The flow here is roughly that we determine what type of object store we're creating by looking at the first part of the URL (the scheme), and then use the associated object store Provider object to generate the appropriate store_prefix from the second part of the URL (the authority). We want to be able to do all this relatively quickly, without instantiating heavyweight objects, to support the cache key use-case.
In the specific case of Azure, the account name may not even be in the URL at all. It can be passed as part of the storage options. Therefore, the function generating the object store prefix must also take the storage_options as a parameter. The final object store prefix we generate for Azure must include this account information, as well as the container information.
This PR modifies WrappingObjectStore.wrap() to take the store prefix as an argument, rather than the storage options. In general, the storage options are highly specific to the type of the object store, which the wrapper generally doesn't know. It would not make sense to look for Azure-specific storage options when wrapping an s3 store, for example.