Skip to content

Conversation

@junyuc25
Copy link
Contributor

@junyuc25 junyuc25 commented Oct 24, 2023

What changes were proposed in this pull request?

As Spark is moving to 4.0, one of the major improvement is to upgrade AWS SDK to v2,
as tracked in this parent Jira: https://issues.apache.org/jira/browse/SPARK-44124.

Currently, some tests in this module (i.e. DepsTestsSuite) uses S3 client which requires
AWS credentials during initialization.

As part of the SDK upgrade, the main purpose of this PR is to upgrading AWS SDK to v2
for the Kubernetes integration tests module.

Why are the changes needed?

As the GA of AWS SDK v2, the SDKv1 has entered maintenance mode where its future
release are only limited to address critical bug and security issues. More details
about the SDK maintenance policy can be found here: https://docs.aws.amazon.com/sdkref/latest/guide/maint-policy.html.
To keep Spark’s dependent softwares up to date, we should consider upgrading the SDK to v2.

Does this PR introduce any user-facing change?

No because this change only impacts the integration tests codes.

How was this patch tested?

The existing integration tests in the k8s integration test module passed

Was this patch authored or co-authored using generative AI tooling?

No

@junyuc25 junyuc25 marked this pull request as ready for review October 24, 2023 14:25
@junyuc25 junyuc25 marked this pull request as draft October 27, 2023 08:07
@junyuc25 junyuc25 changed the title [Don't merge or review][WIP] Test k8s changes [Don't merge or review][WIP] Upgrade AWS SDK to v2 for Kubernetes integration tests module Oct 31, 2023
@junyuc25 junyuc25 changed the title [Don't merge or review][WIP] Upgrade AWS SDK to v2 for Kubernetes integration tests module [SPARK-45719][K8S] Upgrade AWS SDK to v2 for Kubernetes integration tests module Oct 31, 2023
@junyuc25 junyuc25 marked this pull request as ready for review October 31, 2023 06:25
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing i'm worried about here. it is only mino after all.

pom.xml Outdated
<aws.kinesis.client.version>1.12.0</aws.kinesis.client.version>
<!-- Should be consistent with Kinesis client dependency -->
<aws.java.sdk.version>1.11.655</aws.java.sdk.version>
<aws.java.sdk.v2.version>2.20.128</aws.java.sdk.v2.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop is @ 2.20.160 already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've updated the version.

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.8.0/installer/volcano-development.yaml || true
eval $(minikube docker-env)
build/sbt -Psparkr -Pkubernetes -Pvolcano -Pkubernetes-integration-tests -Dspark.kubernetes.test.driverRequestCores=0.5 -Dspark.kubernetes.test.executorRequestCores=0.2 -Dspark.kubernetes.test.volcanoMaxConcurrencyJobNum=1 -Dtest.exclude.tags=local "kubernetes-integration-tests/test"
build/sbt -Phadoop-3 -Psparkr -Pkubernetes -Pvolcano -Pkubernetes-integration-tests -Dspark.kubernetes.test.driverRequestCores=0.5 -Dspark.kubernetes.test.executorRequestCores=0.2 -Dspark.kubernetes.test.volcanoMaxConcurrencyJobNum=1 -Dtest.exclude.tags=local "kubernetes-integration-tests/test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change is needed for this PR? "hadoop-3" is a default hadoop profile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without explicitly activating this profile, I'm seeing compilation issues during build:

[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:30: not found: object software
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:31: not found: object software
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:32: not found: object software
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:33: not found: object software
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:34: not found: object software
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:309: not found: type S3Client
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:310: not found: value AwsBasicCredentials
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:311: not found: value S3Client
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:312: not found: value StaticCredentialsProvider
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:314: not found: value Region
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:323: not found: value CreateBucketRequest
[ERROR] /Users/xxx/repositories/Spark-upgrade-k8s/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:342: not found: value PutObjectRequest

In the pom file of this integration test module, hadoop-3 profile is set to <activeByDefault>true</activeByDefault>. This setting means, when another profile in this pom is activated via -P in command line (i.e. -Pvolcano in this case), the hadoop-3 profile would be deactivated. This behavior is also explained here. Since the volcano profile is explicitly activated, we would also need to explicitly activate hadoop-3 to address the above errors.

val BUCKET = "spark"
val ACCESS_KEY = "minio"
val SECRET_KEY = "miniostorage"
val REGION = "us-west-2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does v1 work without REGION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When initializing a v2 S3 client, it is required to specify a region (either via codes, environment variables or system properties etc).
But when initializing a v1 S3 client, region is not mandatory. For instance, I was able to list all buckets across regions with the following.

    val s3client = new AmazonS3Client()
    val response1 = s3client.listBuckets()
    println(response1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole region thing is a real pain...for AWS itself you are now expected to declare the region and have it work everything out; for third party just the endpoint and any region string should suffice. See HADOOP-18908 for our ongoing struggles there

@junyuc25
Copy link
Contributor Author

Hi @dongjoon-hyun, wonder if you could take a look?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-45719][K8S] Upgrade AWS SDK to v2 for Kubernetes integration tests module [SPARK-45719][K8S][TESTS] Upgrade AWS SDK to v2 for Kubernetes IT Nov 15, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
Thank you, @junyuc25 , @steveloughran , @LantaoJin .

Merged to master for Apache Spark 4.0.0.

@dongjoon-hyun
Copy link
Member

Welcome to the Apache Spark community, @junyuc25 !
I added you to the Apache Spark contributor group and assigned SPARK-45719 to you.

dongjoon-hyun added a commit that referenced this pull request Mar 28, 2024
### What changes were proposed in this pull request?

This PR aims to ban `AWS SKD for Java v1`. We migrated to v2 via the following.
- #45583
- #43510

### Why are the changes needed?

To ensure the migration to AWS SDK for Java v2 because of the following the end of support schedule. `v2` is strongly recommended since July.
- https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-aws-sdk-for-java-v1-x-on-december-31-2025/
> AWS SDK for Java v1.x will enter maintenance mode on July 31, 2024, and reach end-of-support on December 31, 2025.

### Does this PR introduce _any_ user-facing change?

No, this PR only prevents mixing this old dependency in the future.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45759 from dongjoon-hyun/SPARK-47632.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants