-
Notifications
You must be signed in to change notification settings - Fork 304
fix: cache object stores and bucket regions to reduce DNS query volume #3802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
b8fe072
fix: cache object stores and bucket regions to reduce DNS query volume
andygrove 3fe3c01
fmt
andygrove 37c3c81
fix: use from_url_path for cached object store path extraction
andygrove 4f1cc4e
docs: document why static lifetime is appropriate for object store ca…
andygrove 121bd5d
docs: clarify that process lifetime == application lifetime in K8s
andygrove File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,10 +35,13 @@ use datafusion::execution::object_store::ObjectStoreUrl; | |
| use datafusion::execution::runtime_env::RuntimeEnv; | ||
| use datafusion::physical_plan::ColumnarValue; | ||
| use datafusion_comet_spark_expr::EvalMode; | ||
| use log::debug; | ||
| use object_store::path::Path; | ||
| use object_store::{parse_url, ObjectStore}; | ||
| use std::collections::HashMap; | ||
| use std::sync::OnceLock; | ||
| use std::time::Duration; | ||
| use std::{collections::hash_map::DefaultHasher, hash::Hasher, sync::RwLock}; | ||
| use std::{fmt::Debug, hash::Hash, sync::Arc}; | ||
| use url::Url; | ||
|
|
||
|
|
@@ -444,6 +447,56 @@ fn create_hdfs_object_store( | |
| }) | ||
| } | ||
|
|
||
| type ObjectStoreCache = RwLock<HashMap<(String, u64), Arc<dyn ObjectStore>>>; | ||
|
|
||
| /// Process-wide cache of object stores, keyed by `(scheme://host:port, config_hash)`. | ||
| /// | ||
| /// ## Why static / process lifetime? | ||
| /// | ||
| /// Comet's JNI architecture calls `initRecordBatchReader` once per Parquet file, and each | ||
| /// call constructs a fresh `RuntimeEnv`. There is therefore no executor-scoped Rust object | ||
| /// with a lifetime longer than a single file read that could own this cache. The executor | ||
| /// process itself is the natural scope for HTTP connection-pool reuse, so process lifetime | ||
| /// (i.e. `static`) is the appropriate choice here. In the standard Spark-on-Kubernetes | ||
| /// deployment model each executor process is dedicated to a single Spark application, so | ||
| /// process lifetime and application lifetime are equivalent; the cache is reclaimed when | ||
| /// the executor pod terminates. | ||
| /// | ||
| /// ## Unbounded size | ||
| /// | ||
| /// Cache entries are indexed by `(scheme://host:port, hash-of-configs)`. A typical Spark | ||
| /// job accesses a small, fixed set of buckets with a stable configuration, so the number of | ||
| /// distinct keys is O(buckets × credential-configs) and remains small throughout the job. | ||
| /// Entries are cheap relative to the cost of creating a new object store (new HTTP | ||
| /// connection pool + DNS resolution), and there is no meaningful benefit from eviction, so | ||
| /// no eviction policy is applied. | ||
| /// | ||
| /// ## Credential invalidation | ||
| /// | ||
| /// Object stores that use dynamic credentials (IMDS, WebIdentity, ECS role, STS assume-role) | ||
| /// delegate credential refresh to a `CometCredentialProvider` that fetches fresh credentials | ||
| /// on every request, so credential rotation is transparent and requires no cache | ||
| /// invalidation. Object stores whose credentials are embedded in the Hadoop configuration | ||
| /// (e.g. `fs.s3a.access.key` / `fs.s3a.secret.key`) produce a different `config_hash` when | ||
| /// those values change, which causes a new store to be created and inserted under the new | ||
| /// key; the old entry is harmlessly superseded. | ||
| fn object_store_cache() -> &'static ObjectStoreCache { | ||
| static CACHE: OnceLock<ObjectStoreCache> = OnceLock::new(); | ||
| CACHE.get_or_init(|| RwLock::new(HashMap::new())) | ||
| } | ||
|
|
||
| /// Compute a hash of the object store configuration for cache keying. | ||
| fn hash_object_store_configs(configs: &HashMap<String, String>) -> u64 { | ||
| let mut hasher = DefaultHasher::new(); | ||
| let mut keys: Vec<&String> = configs.keys().collect(); | ||
| keys.sort(); | ||
| for key in keys { | ||
| key.hash(&mut hasher); | ||
| configs[key].hash(&mut hasher); | ||
| } | ||
| hasher.finish() | ||
| } | ||
|
|
||
| /// Parses the url, registers the object store with configurations, and returns a tuple of the object store url | ||
| /// and object store path | ||
| pub(crate) fn prepare_object_store_with_configs( | ||
|
|
@@ -467,17 +520,45 @@ pub(crate) fn prepare_object_store_with_configs( | |
| &url[url::Position::BeforeHost..url::Position::AfterPort], | ||
| ); | ||
|
|
||
| let (object_store, object_store_path): (Box<dyn ObjectStore>, Path) = if is_hdfs_scheme { | ||
| create_hdfs_object_store(&url) | ||
| } else if scheme == "s3" { | ||
| objectstore::s3::create_store(&url, object_store_configs, Duration::from_secs(300)) | ||
| } else { | ||
| parse_url(&url) | ||
| } | ||
| .map_err(|e| ExecutionError::GeneralError(e.to_string()))?; | ||
| let config_hash = hash_object_store_configs(object_store_configs); | ||
| let cache_key = (url_key.clone(), config_hash); | ||
|
|
||
| // Check the cache first to reuse existing object store instances. | ||
| // This enables HTTP connection pooling and avoids redundant DNS lookups. | ||
| let cached = { | ||
| let cache = object_store_cache() | ||
| .read() | ||
| .map_err(|e| ExecutionError::GeneralError(format!("Object store cache error: {e}")))?; | ||
| cache.get(&cache_key).cloned() | ||
| }; | ||
|
|
||
| let (object_store, object_store_path): (Arc<dyn ObjectStore>, Path) = | ||
| if let Some(store) = cached { | ||
| debug!("Reusing cached object store for {url_key}"); | ||
| let path = Path::from_url_path(url.path()) | ||
| .map_err(|e| ExecutionError::GeneralError(e.to_string()))?; | ||
| (store, path) | ||
| } else { | ||
| debug!("Creating new object store for {url_key}"); | ||
| let (store, path): (Box<dyn ObjectStore>, Path) = if is_hdfs_scheme { | ||
| create_hdfs_object_store(&url) | ||
| } else if scheme == "s3" { | ||
| objectstore::s3::create_store(&url, object_store_configs, Duration::from_secs(300)) | ||
| } else { | ||
| parse_url(&url) | ||
| } | ||
| .map_err(|e| ExecutionError::GeneralError(e.to_string()))?; | ||
|
|
||
| let store: Arc<dyn ObjectStore> = Arc::from(store); | ||
| // Insert into cache | ||
| if let Ok(mut cache) = object_store_cache().write() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nvm, it is cache write, not a object store write |
||
| cache.insert(cache_key, Arc::clone(&store)); | ||
| } | ||
| (store, path) | ||
| }; | ||
|
|
||
| let object_store_url = ObjectStoreUrl::parse(url_key.clone())?; | ||
| runtime_env.register_object_store(&url, Arc::from(object_store)); | ||
| runtime_env.register_object_store(&url, object_store); | ||
| Ok((object_store_url, object_store_path)) | ||
| } | ||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything
static(global singleton) should be documented whystaticis the reasonable life cycle. In particular, I wonder about the unbounded size of astaticcache, invalidation scenarios (what if a job runs long enough and needs new credentials passed into the object_store?), and why there was no other location with a reasonable life cycle to own this cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docs, but I have not yet confirmed that the part about credential provider interaction is actually correct, so moved to draft for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbutrovich could you take another look?