[GLUTEN-9182][VL] Support new s3 configuration in Gluten#9183
[GLUTEN-9182][VL] Support new s3 configuration in Gluten#9183marin-ma merged 3 commits intoapache:mainfrom
Conversation
14363ce to
b9babdb
Compare
|
cc @zhouyuan Could you please have a review for this PR? |
zhouyuan
left a comment
There was a problem hiding this comment.
Thanks for adding these configurations!
| * TRACE | ||
| "OFF", "FATAL", "ERROR", "WARN", "INFO", "DEBUG", "TRACE". | ||
|
|
||
| ## Configuring Whether To Use Proxy From Env for S3 C++ Client |
There was a problem hiding this comment.
i understand this is aligning with velox config, but is this a standard aws-sdk-cpp feature? they seems to disable proxy by intention
aws/aws-sdk-cpp#1049
There was a problem hiding this comment.
I believe this is a standard feature, we can see this doc: https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/client-config.html
cpp/velox/utils/ConfigExtractor.cc
Outdated
| {S3Config::Keys::kIamRole, std::make_pair("iam.role", std::nullopt)}, | ||
| {S3Config::Keys::kIamRoleSessionName, std::make_pair("iam.role.session.name", "gluten-session")}, | ||
| {S3Config::Keys::kEndpointRegion, std::make_pair("endpoint.region", std::nullopt)}, | ||
| {S3Config::Keys::kCredentialsProvider, std::make_pair("aws.credentials.provider", std::nullopt)}, |
There was a problem hiding this comment.
I wonder if it's safe to add this mapping in Gluten. If this configuration is not set, Velox uses its own logic to create credential providers that are supported in aws-sdk-cpp.
Say if a workload is configured with spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider and access/secret keys are set, it should go to (code link). But adding this mapping may break that part of logic and go to (code link). Then it will get the error 'CredentialsProviderFactory for 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider' not registered'
@dcoliversun @zhouyuan Do you have any thoughts or suggestions?
There was a problem hiding this comment.
@marin-ma Your concern is valid. Although the SimpleAWSCredentialsProvider is not required when declaring ak/sk, this piece of code might still disrupt some stable workflows. One idea I have is to use a different configuration name to distinguish it, or not set credentials provider configuration when declaring ak/sk.
If we choose second path, I guess it's better to add some codes in velox.
There was a problem hiding this comment.
Can we use the registration to add mappings from AWSCredentialsProvider to the implementation in aws-sdk-cpp?
e.g.
registerAWSCredentialsProvider("org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider", [](const S3Config& config) {
GLUTEN_CHECK(
config.accessKey().has_value() && !config.accessKey().value().empty(),
"Access key cannot be empty for SimpleAWSCredentialsProvider");
GLUTEN_CHECK(
config.secretKey().has_value() && !config.secretKey().value().empty(),
"Secret key cannot be empty for SimpleAWSCredentialsProvider");
return std::make_shared<Aws::Auth::SimpleAWSCredentialsProvider>(
config.accessKey().value(), config.secretKey().value());
});
Then we need to support all valid mappings in Gluten. Perhaps we can remove the configuration from this patch and add it back in another one together with adding support for all valid mappings. cc: @FelixYBW
There was a problem hiding this comment.
Sure, have removed the configuration from this patch. cc @marin-ma
|
Run Gluten Clickhouse CI on x86 |
| ("spark.gluten.sql.columnar.backend.velox.fileHandleCacheEnabled", "false"), | ||
| ("spark.gluten.velox.awsSdkLogLevel", "FATAL") | ||
| ("spark.gluten.velox.awsSdkLogLevel", "FATAL"), | ||
| ("spark.gluten.velox.s3UseProxyFromEnv", "false"), |
There was a problem hiding this comment.
Why set them in backend conf? Looks like session conf is enough.
There was a problem hiding this comment.
Just like sdkLogLevel configuration, will be used in hive connector in velox
|
Run Gluten Clickhouse CI on x86 |
|
@marin-ma Thanks for your review, it's ready to merge 😄 |
|
===== Performance report for TPCDS SF2000 with Velox backend, for reference only ====
|
|
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
What changes were proposed in this pull request?
This PR aims to add the s3a configuration supported by Velox in Gluten.
(Fixes: #9182)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)