[SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0 #45583

dongjoon-hyun · 2024-03-19T03:53:05Z

What changes were proposed in this pull request?

This PR aims to upgrade to Apache Hadoop 3.4.0 for Apache Spark 4.0.0.

Why are the changes needed?

To bring the new features like the following

https://hadoop.apache.org/docs/r3.4.0
- HADOOP-18995 Upgrade AWS SDK version to 2.21.33 for S3 Express One Zone
- HADOOP-18996 S3A to provide full support for S3 Express One Zone
- HADOOP-18328 Supports S3 on Outposts

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun · 2024-03-19T03:56:14Z

Hi, @LuciferYang . I tried to search your previous PR , but couldn't find it. So, I recreated with your coauthor ship here.

If you want to re-open your PR, please let me know, @LuciferYang .

dongjoon-hyun · 2024-03-19T03:57:41Z

BTW, I'll add the following based on the failed module, if needed.

    <dependency>
      <groupId>org.bouncycastle</groupId>
      <artifactId>bcpkix-jdk18on</artifactId>
      <version>${bouncycastle.version}</version>
      <scope>test</scope>
    </dependency>

dongjoon-hyun · 2024-03-19T03:58:56Z

dev/deps/spark-deps-hadoop-3-hive-2.3

https://mvnrepository.com/artifact/software.amazon.awssdk/bundle/2.23.19

What a big jar...:)

.... surprising

LuciferYang · 2024-03-19T05:02:32Z

Hi, @LuciferYang . I tried to search your previous PR , but couldn't find it. So, I recreated with your coauthor ship here.

If you want to re-open your PR, please let me know, @LuciferYang .

Thanks @dongjoon-hyun ~ Let's use this one ~

dongjoon-hyun · 2024-03-19T05:23:22Z

Got it. Thank you, @LuciferYang .

LuciferYang

+1, LGTM (pending test)

dongjoon-hyun · 2024-03-20T03:49:10Z

Thank you, @LuciferYang .

For this PR, I fixed the following three so far, but I guess there is one more to go. Let's see the CI result.

dongjoon-hyun · 2024-03-20T17:23:40Z

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala


-  test("running Spark in yarn-cluster mode displays driver log links") {
+  // TODO(SPARK-47491): Re-enable `driver log links` test in YarnClusterSuite
+  ignore("running Spark in yarn-cluster mode displays driver log links") {


I'll handle this later in SPARK-47491 because YARN-only test PR doesn't cause a full GitHub Action. It's more easier to fix it. Currently, this PR causes the full CI builds and it makes the investigation difficult.

dongjoon-hyun · 2024-03-20T17:37:27Z

At the previous commit (https://github.com/dongjoon-hyun/spark/runs/22865082407), we already passed all tests except one YARN failure. And, the failed test case is ignored here. Let me merge this to move forward.

Thank you, @LuciferYang and @yaooqinn .

### What changes were proposed in this pull request? This PR aims to ban `AWS SKD for Java v1`. We migrated to v2 via the following. - #45583 - #43510 ### Why are the changes needed? To ensure the migration to AWS SDK for Java v2 because of the following the end of support schedule. `v2` is strongly recommended since July. - https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-aws-sdk-for-java-v1-x-on-december-31-2025/ > AWS SDK for Java v1.x will enter maintenance mode on July 31, 2024, and reach end-of-support on December 31, 2025. ### Does this PR introduce _any_ user-facing change? No, this PR only prevents mixing this old dependency in the future. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45759 from dongjoon-hyun/SPARK-47632. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

LuciferYang · 2024-07-29T06:58:02Z

Sorry to disturb everyone, but when I execute OrcEncryptionSuite on my M2 Max, I find that there are some differences when using Hadoop 3.4.0 and Hadoop 3.3.4.

build/sbt clean "sql/testOnly org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite"

branch-3.5(with hadoop 3.3.4)

[info] OrcEncryptionSuite:
14:44:11.580 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - Write and read an encrypted file (1 second, 921 milliseconds)
[info] - Write and read an encrypted table (374 milliseconds)
[info] - SPARK-35325: Write and read encrypted nested columns (358 milliseconds)
[info] - SPARK-35992: Write and read fully-encrypted columns with default masking (570 milliseconds)
14:44:15.461 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: 

[info] Run completed in 4 seconds, 694 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

master(with hadoop 3.4.0)

[info] OrcEncryptionSuite:
14:49:15.267 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14:49:17.636 WARN org.apache.hadoop.crypto.OpensslCipher: Failed to load OpenSSL Cipher.
java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl()'
	at org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl(Native Method)
	at org.apache.hadoop.crypto.OpensslCipher.<clinit>(OpensslCipher.java:86)
	at org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec.<init>(OpensslAesCtrCryptoCodec.java:36)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
[info] - Write and read an encrypted file (2 seconds, 343 milliseconds)
[info] - Write and read an encrypted table (405 milliseconds)
[info] - SPARK-35325: Write and read encrypted nested columns (308 milliseconds)
[info] - SPARK-35992: Write and read fully-encrypted columns with default masking (555 milliseconds)
14:49:19.493 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: 

[info] Run completed in 5 seconds, 84 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

When using Hadoop 3.4.0, although there were no test failures, an UnsatisfiedLinkError was thrown. Is this expected, or do I need to configure additional dependencies? This issue should only occur on Apple Silicon chips now. @dongjoon-hyun @steveloughran

dongjoon-hyun · 2024-07-30T04:05:27Z

To @LuciferYang , have you try other combination like branch-3.4 + Hadoop 3.4 or master + 3.3.4? It seems that you are reporting only the difference on branches which could be a result of many other stuffs like ORC version difference or installed ssl versions.

I find that there are some differences when using Hadoop 3.4.0 and Hadoop 3.3.4.

In other words, I'm wondering if you are reporting that the result before this commit and after this commit, @LuciferYang . For your claim, could you share us that result?

LuciferYang · 2024-07-30T04:50:33Z

@dongjoon-hyun Test on master:

before this commit:

git reset --hard a34c8ceb19bd1c1548a60bb144d1c587a2861cd8 // [SPARK-47462][SQL] Align mappings of other unsigned numeric types with TINYINT in MySQLDialect
build/sbt clean "sql/testOnly org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite"

result:

[info] OrcEncryptionSuite:
12:47:33.148 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - Write and read an encrypted file (2 seconds, 117 milliseconds)
[info] - Write and read an encrypted table (375 milliseconds)
[info] - SPARK-35325: Write and read encrypted nested columns (275 milliseconds)
[info] - SPARK-35992: Write and read fully-encrypted columns with default masking (517 milliseconds)
12:47:37.058 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: 

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: ForkJoinPool.commonPool-worker-4 (daemon=true), rpc-boss-3-1 (daemon=true), Thread-17 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), Thread-18 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true) =====
[info] Run completed in 4 seconds, 760 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

after this commit:

git reset --hard 49b4c3bc9c09325de941dfaf41e4fd3a4a4c345f // [SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0
build/sbt clean "sql/testOnly org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite"

result:

[info] OrcEncryptionSuite:
12:42:55.441 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12:42:57.950 WARN org.apache.hadoop.crypto.OpensslCipher: Failed to load OpenSSL Cipher.
java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl()'
	at org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl(Native Method)
	at org.apache.hadoop.crypto.OpensslCipher.<clinit>(OpensslCipher.java:86)
	at org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec.<init>(OpensslAesCtrCryptoCodec.java:36)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
[info] - Write and read an encrypted file (2 seconds, 486 milliseconds)
[info] - Write and read an encrypted table (402 milliseconds)
[info] - SPARK-35325: Write and read encrypted nested columns (299 milliseconds)
[info] - SPARK-35992: Write and read fully-encrypted columns with default masking (623 milliseconds)
12:42:59.856 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: 

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), Thread-17 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), Thread-18 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true) =====
[info] Run completed in 5 seconds, 291 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

screenshot of git log:

dongjoon-hyun · 2024-07-30T04:53:59Z

Ack. Thank you for sharing. Let me take a look at that as a independent JIRA issue because it's Mac only issue, @LuciferYang .

dongjoon-hyun · 2024-07-30T04:57:20Z

SPARK-49055 is filed, @LuciferYang .

LuciferYang · 2024-07-30T04:57:26Z

Ack. Thank you for sharing. Let me take a look at that as a independent JIRA issue because it's Mac only issue, @LuciferYang .

@dongjoon-hyun I apologize for providing misleading information. I just reviewed the recent GA test logs and I found that this is not a Mac Only issue:

https://github.com/apache/spark/actions/runs/10155611310/job/28082653276

dongjoon-hyun · 2024-07-30T05:08:37Z

To @LuciferYang , according to the Hadoop code, HADOOP-17982 seems to change the log level only at Hadoop 3.4.0.

https://github.com/apache/hadoop/pull/3599/files

dongjoon-hyun · 2024-07-30T05:09:58Z

There is no functional change in Hadoop code since last 10 year except the above log level change. Given that, we can ignore the warning message. WDYT, @LuciferYang ?

LuciferYang · 2024-07-30T05:11:38Z

@dongjoon-hyun Thank you for your explanation, agree with you

github-actions bot added the BUILD label Mar 19, 2024

dongjoon-hyun commented Mar 19, 2024

View reviewed changes

dongjoon-hyun force-pushed the SPARK-45393 branch from 5c301dc to d960d98 Compare March 19, 2024 15:16

dongjoon-hyun mentioned this pull request Mar 19, 2024

ORC-1608: Upgrade Hadoop to 3.4.0 apache/orc#1783

Closed

github-actions bot added the SPARK SHELL label Mar 19, 2024

dongjoon-hyun force-pushed the SPARK-45393 branch from fbe5d40 to 179848f Compare March 20, 2024 00:53

github-actions bot removed the SPARK SHELL label Mar 20, 2024

dongjoon-hyun and others added 2 commits March 19, 2024 20:25

[SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0

02195c3

Add YangJie

2358f85

dongjoon-hyun force-pushed the SPARK-45393 branch from 179848f to 2358f85 Compare March 20, 2024 03:25

LuciferYang approved these changes Mar 20, 2024

View reviewed changes

dongjoon-hyun marked this pull request as ready for review March 20, 2024 14:55

github-actions bot added the YARN label Mar 20, 2024

Ignore yarn test

798c5a1

dongjoon-hyun force-pushed the SPARK-45393 branch from ab86d62 to 798c5a1 Compare March 20, 2024 17:23

dongjoon-hyun commented Mar 20, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Mar 20, 2024

[SPARK-27950][DSTREAMS][Kinesis] dynamoDBEndpointUrl and cloudWatchMetricsLevel for Kinesis #24801

Closed

dongjoon-hyun closed this in 49b4c3b Mar 20, 2024

dongjoon-hyun mentioned this pull request Mar 20, 2024

[SPARK-47141][CORE] Support enabling migration of shuffle data directly to external storage using config parameter. #45228

Closed

dongjoon-hyun mentioned this pull request Mar 28, 2024

[SPARK-47632][BUILD] Ban com.amazonaws:aws-java-sdk-bundle dependency #45759

Closed

dongjoon-hyun deleted the SPARK-45393 branch July 30, 2024 04:05

razvan mentioned this pull request Aug 16, 2024

chore(spark): version 4.0.0-preview stackabletech/docker-images#808

Closed

[SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0 #45583

[SPARK-45393][BUILD] Upgrade Hadoop to 3.4.0 #45583

Uh oh!

Conversation

dongjoon-hyun commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Mar 19, 2024

Uh oh!

dongjoon-hyun commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun Mar 19, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Mar 19, 2024

Uh oh!

dongjoon-hyun commented Mar 19, 2024

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

dongjoon-hyun Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 20, 2024

Uh oh!

LuciferYang commented Jul 29, 2024

Uh oh!

dongjoon-hyun commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 30, 2024

Uh oh!

dongjoon-hyun commented Jul 30, 2024

Uh oh!

LuciferYang commented Jul 30, 2024

Uh oh!

dongjoon-hyun commented Jul 30, 2024

Uh oh!

dongjoon-hyun commented Jul 30, 2024

Uh oh!

LuciferYang commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun commented Mar 19, 2024 •

edited

Loading

dongjoon-hyun commented Mar 19, 2024 •

edited

Loading

dongjoon-hyun Mar 20, 2024 •

edited

Loading

dongjoon-hyun commented Jul 30, 2024 •

edited

Loading

LuciferYang commented Jul 30, 2024 •

edited

Loading