[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 #30135

sunchao · 2020-10-22T18:35:47Z

What changes were proposed in this pull request?

This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file.

Why are the changes needed?

Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests in Spark
Manually tested using my S3 bucket for event log dir:

bin/spark-shell \
  -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \
  -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \
  -c spark.eventLog.enabled=true
  -c spark.eventLog.dir=s3a://<my-bucket>

Manually tested against docker-based YARN dev cluster, by running SparkPi.

SparkQA · 2020-10-22T19:21:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34781/

SparkQA · 2020-10-22T19:42:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34781/

dongjoon-hyun · 2020-10-22T20:32:39Z

The failure seems to be relevant since they are YARN test suite.

Error:  Error: Total 118, Failed 0, Errors 3, Passed 115, Canceled 1
Error:  Error during tests:
Error:  	org.apache.spark.deploy.yarn.YarnClusterSuite
Error:  	org.apache.spark.deploy.yarn.YarnShuffleAuthSuite
Error:  	org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

dongjoon-hyun · 2020-10-22T20:33:04Z

BTW, thank you so much, @sunchao !

sunchao · 2020-10-22T21:21:14Z

Yes the test failures are new. It seems to be related to #29843 but I'm not sure why I've never seen it before in all the previous runs.

SparkQA · 2020-10-22T21:35:56Z

Test build #130175 has finished for PR 30135 at commit 1a4cdeb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-10-22T21:37:45Z

And all the tests in jenkins are passing. This is weird ...

dongjoon-hyun · 2020-10-22T21:53:35Z

Oh..

dongjoon-hyun · 2020-10-22T21:54:12Z

I retriggered GitHub Action. Let's see.

dongjoon-hyun · 2020-10-22T22:03:08Z

cc @srowen , @dbtsai

dongjoon-hyun · 2020-10-22T22:20:40Z

This seems to hit bouncycastle.

$ build/sbt "yarn/testOnly *.YarnClusterSuite" -Pyarn -Phadoop-3.2
...
[info] YarnClusterSuite:
[info] org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (580 milliseconds)
[info]   java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/bouncycastle/operator/OperatorCreationException

srowen · 2020-10-22T22:26:27Z

What does it imply for compatibility with clusters? it works with Hadoop 3.2, earlier? we may need to rename the profile eventually to "hadoop-3" or something. Implicitly it's "Hadoop 3.2+"

sunchao · 2020-10-22T23:07:52Z

Yeah it should work with Hadoop 3.2+ clusters since Hadoop maintains wire compatibility across minor releases. +1 on renaming this to hadoop-3 (and probably rename hadoop-2.7 to hadoop-2 as well).

dongjoon-hyun · 2020-10-23T20:23:52Z

I retriggered GitHub Action once more.

@srowen and @sunchao . For renaming, +1. Shall we do in another PR because it will touch many scripts?

hadoop-3.2 -> hadoop-3
hadoop-2.7 -> hadoop-2.

github-actions · 2021-02-01T00:55:46Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

sunchao · 2021-02-02T01:21:00Z

This is staled because Spark can't use Hadoop 3.3.0 as it is (see HADOOP-16080). The work may be resumed once Hadoop 3.3.1 is released.

dongjoon-hyun · 2021-02-02T02:41:18Z

Thank you for keeping working on this area.

HyukjinKwon · 2021-02-02T03:28:44Z

👍

sunchao · 2021-05-24T16:26:38Z

Sine Hadoop 3.3.1 RC1 is out, I'm going to revive this PR and test it here.

sunchao · 2021-06-16T23:01:22Z

Thanks @dongjoon-hyun for merging!

dongjoon-hyun · 2021-06-19T21:29:46Z

Hi, All. FYI,

From Apache Hadoop 3.3.1, we reverted HADOOP-16878 as the last revert commit on branch-3.3.1.

apache/hadoop@a3b9c37

However, Apache Spark Jenkins seems to hit a flakiness issue.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/lastCompletedBuild/testReport/org.apache.spark.deploy.yarn/ClientSuite/distribute_jars_archive/history/

org.apache.hadoop.fs.PathOperationException: 
`Source (file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar) and destination (/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar)
are equal in the copy command.': Operation not supported
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:403)

Please note that this is a flakiness and at the same time the error log is pointing FileUtil.java:403 which is the code before reverting. I'm investigating this.

cc @gengliangwang since he is the release manager for Apache Spark 3.2.0.

dongjoon-hyun · 2021-06-19T22:07:07Z

Since Apache Hadoop trunk has this behavior, I made a PR in order to be more robust on the underlying behavior difference.

[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest #32983

gengliangwang · 2021-06-20T17:16:14Z

@dongjoon-hyun Thanks for taking care of the error!

arghya18 · 2021-06-25T04:06:43Z

What's the latest on this? Any issue to use Hadoop 3.3.1 with Spark?

dongjoon-hyun · 2021-06-25T04:15:11Z

We are preparing Apache Spark 3.2.0 with Hadoop 3.3.1, @arghya18 .

SPARK-35831 is fixed.
SPARK-35868 is also fixed.
SPARK-35878 has the PR from @steveloughran , too.

arghya18 · 2021-06-25T04:28:36Z

@dongjoon-hyun Thanks for the info. Actually we are having some major issue related to s3a HADOOP-17755 with orc which seems to be solved in Hadoop 3.3.0 for parquet but seems its a common issue . HADOOP-16109 . Currently we are working around it using fs.s3a.readahead.range": "1G" which is not a good idea.

I am using Spark on k8s and magic committer, so wanted to check if the above issues mentioned will impact me if I manually build Spark 3.1.1 with Hadoop 3.3.0/3.3.1?

dongjoon-hyun · 2021-06-25T07:24:00Z

Thank you for sharing, @arghya18 .
HADOOP-17755 sounds like read-side issue and Magic committer is write-side feature. I don't think they are related. If you hit a magic committer issue, please file a JIRA at Apache Hadoop community.

arghya18 · 2021-06-25T07:33:57Z

@dongjoon-hyun Thanks for your response. Yes I understand magic committer is not related, I just wanted to understand if I build Spark 3.1.1 with Hadoop 3.3.0/3.3.1 will the existing issues (SPARK-35831, SPARK-35868, SPARK-35878) to be merged will impact my use case? I am using s3a to read and write orc files or I have to wait for Spark 3.2.0 ?

steveloughran · 2021-06-25T13:51:56Z

@arghya18

we havent seen HADOOP-17755 in any of our testing, unless it is HADOOP-16109
still waiting on that JIRA for you to provide config details. Like I've said: no info, we close the JIRA.

wanted to check if the above issues mentioned will impact me if I manually build Spark 3.1.1 with Hadoop 3.3.0/3.3.1?

well, what can anyone say? without info from you all we can say is WORKSFORME

The best thing you can do to help everyone is checkout and build with Hadoop 3.3.1 (which does contain a lot of changes related to the committers, including from @dongjoon-hyun), and see if the problem is still there.

arghya18 · 2021-06-25T14:05:45Z

The best thing you can do to help everyone is checkout and build with Hadoop 3.3.1 (which does contain a lot of changes related to the committers, including from @dongjoon-hyun), and see if the problem is still there.

@steveloughran Yes that is the plan. I will do and post the details of the result this weekend.

steveloughran · 2021-06-25T14:12:49Z

the parquet EOF fix is also in hadoop-3.2.2, so you could try that. However, testing with 3.3.1 is better because

we can do workarounds in spark before the release
we can fix hadoop branch-3.3;
any bug you file against hadoop 3.2.x will have "try with 3.3.1" as the initial respose

arghya18 · 2021-06-25T14:21:54Z

@steveloughran Thanks for the suggestion.
Can anyone help me with steps (or Dockerfile) to change hadoop version in prebuilt spark docker image with Hadoop if available handy for me to quickly progress else I will google, newbie in docker :)

Thanks a lot.

arghya18 · 2021-07-02T04:20:46Z

@dongjoon-hyun @steveloughran I was able to test my use case with Hadoop 3.3.1 and posted the result HADOOP-17755
To my surprise the read is slower(with same resource and same config) in Hadoop 3.3.1 than Hadoop 3.2.0 without the mentioned issue. It is possible I am missing something.

dongjoon-hyun · 2021-07-02T05:34:35Z

Thank you for sharing, @arghya18 . It's interesting. The read statistic increase is also observed in my environment, but TPCDS 1TB on S3 parquet performance was faster for me. I'll keep tracking HADOOP-17755 together.

arghya18 · 2021-07-02T05:38:34Z

@dongjoon-hyun Thanks.. I am testing more jobs for further statistics. BDW I am testing this on ORC.

dongjoon-hyun · 2021-07-02T05:54:28Z

Oh, if you are using ORC, please try to bring SPARK-35783. It's irrelevant to this Hadoop topic, but it helps you reduce the traffic.

[SPARK-35783][SQL] Set the list of read columns in the task configuration to reduce reading of ORC data. #32923

arghya18 · 2021-07-02T06:50:37Z

@dongjoon-hyun Ok.. I will do that but I thought read and time increase is impact of s3a implemented on Hadoop 3.3.1 as that changed only after I upgraded from Hadoop 3.2.0 to Hadoop 3.3.1

steveloughran · 2021-07-02T12:53:16Z

To my surprise the read is slower(with same resource and same config) in Hadoop 3.3.1 than Hadoop 3.2.0 without the mentioned issue. It is possible I am missing something.

Shouldn't happen. really shouldn't happen. We do not see that on our TCP-DS Benchmarks.

The main way I could see this happening is if the seek policy hasn't switched to random on the first backwards seek. Explicitly set it.

spark.hadoop.fs.s3a.experimental.fadvise random

Hadoop 3.3.1 has a stats collection API (IOStatisics) for filesystems, streams, etc.

call toString() on a stream to get its stats, inc #of bytes discarded in seeks, streams aborted
do the same for the FS to get the aggregate stats.

high counts of bytes discarded and aborts are signs of bad seek policy.

set these two logs at debug and see what they say.

org.apache.hadoop.fs.s3a.S3AInputStream
org.apache.hadoop.fs.s3a.S3AStorageStatistics

gengliangwang · 2021-08-06T05:04:41Z

@sunchao @dongjoon-hyun @steveloughran Are there any other known issues of Hadoop 3.3.1? Should we have this upgrade in Spark 3.2?

dongjoon-hyun · 2021-08-06T05:14:19Z

+1 for backporting!

sunchao · 2021-08-06T05:15:15Z

I'm not aware of any issue at the moment - we've already been using this (although a slightly different internal Hadoop version) for a while now. @steveloughran can offer more inputs from Hadoop/S3 side.

Besides that, I wanted to add #33160 so that users can build Spark with older versions of Hadoop which do not support shaded client. If people feel this is useful I can try to push it to the finish line. I also wanted to add some user documentation on compiling Spark against different versions of Hadoop (besides the default one).

steveloughran · 2021-08-09T13:10:02Z

known 3.3.1 regressions. Not AFAIK

gengliangwang · 2021-08-09T15:43:08Z

@steveloughran Thanks for the info!
@sunchao then let's finish #33160 and have it in 3.2. Thanks!

### _Why are the changes needed?_  Spark 3.2.0 is out, which bundles Hadoop 3.3.1 shaded client in default. apache/spark#30135 The test failed when using Hadoop 3.3.1 client connects to Yarn Mini Cluster 3.2.2 ``` Cause: java.lang.RuntimeException: org.apache.kyuubi.KyuubiSQLException:java.lang.ClassCastException: org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterMetricsRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:123) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:271) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy13.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:638) at org.apache.spark.deploy.yarn.Client.$anonfun$submitApplication$1(Client.scala:179) at org.apache.spark.internal.Logging.logInfo(Logging.scala:57) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:65) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220) at org.apache.spark.SparkContext.<init>(SparkContext.scala:581) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943) at org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:103) at org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:155) at org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request Closes #757 from pan3793/hadoop-3.3. Closes #757 7ec9313 [Cheng Pan] [DEPS] Bump Hadoop 3.3.1 Authored-by: Cheng Pan <379377944@qq.com> Signed-off-by: ulysses-you <ulyssesyou@apache.org>

dongjoon-hyun mentioned this pull request Oct 22, 2020

[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile #29843

Closed

dongjoon-hyun marked this pull request as draft October 22, 2020 20:31

dongjoon-hyun changed the title ~~[WIP][SPARK-29250] Upgrade to Hadoop 3.3.0~~ [WIP][SPARK-29250][BUILD] Upgrade to Hadoop 3.3.0 Oct 22, 2020

dongjoon-hyun changed the title ~~[WIP][SPARK-29250][BUILD] Upgrade to Hadoop 3.3.0~~ [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.0 Oct 22, 2020

dongjoon-hyun marked this pull request as ready for review October 22, 2020 22:17

dongjoon-hyun marked this pull request as draft October 22, 2020 22:20

sunchao mentioned this pull request Oct 23, 2020

[SPARK-33135][CORE] Use listLocatedStatus from FileSystem implementations #30019

Closed

dongjoon-hyun mentioned this pull request Nov 11, 2020

[SPARK-33402][CORE] Jobs launched in same second have duplicate MapReduce JobIDs #30319

Closed

github-actions bot added the Stale label Feb 1, 2021

github-actions bot closed this Feb 2, 2021

dbtsai reopened this May 24, 2021

dongjoon-hyun closed this in 506ef9a Jun 16, 2021

sunchao deleted the SPARK-29250 branch June 16, 2021 23:01

pan3793 mentioned this pull request Oct 19, 2021

[DEPS] Bump Hadoop 3.3.1 apache/kyuubi#757

Closed

3 tasks

[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 #30135

[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 #30135

Uh oh!

Conversation

sunchao commented Oct 22, 2020 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 22, 2020

Uh oh!

SparkQA commented Oct 22, 2020

Uh oh!

dongjoon-hyun commented Oct 22, 2020

Uh oh!

dongjoon-hyun commented Oct 22, 2020

Uh oh!

sunchao commented Oct 22, 2020

Uh oh!

SparkQA commented Oct 22, 2020

Uh oh!

sunchao commented Oct 22, 2020

Uh oh!

dongjoon-hyun commented Oct 22, 2020

Uh oh!

dongjoon-hyun commented Oct 22, 2020

Uh oh!

dongjoon-hyun commented Oct 22, 2020

Uh oh!

dongjoon-hyun commented Oct 22, 2020

Uh oh!

srowen commented Oct 22, 2020

Uh oh!

sunchao commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 1, 2021

Uh oh!

sunchao commented Feb 2, 2021

Uh oh!

dongjoon-hyun commented Feb 2, 2021

Uh oh!

HyukjinKwon commented Feb 2, 2021

Uh oh!

sunchao commented May 24, 2021

Uh oh!

sunchao commented Jun 16, 2021

Uh oh!

dongjoon-hyun commented Jun 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 19, 2021

Uh oh!

gengliangwang commented Jun 20, 2021

Uh oh!

arghya18 commented Jun 25, 2021

Uh oh!

dongjoon-hyun commented Jun 25, 2021

Uh oh!

arghya18 commented Jun 25, 2021

Uh oh!

dongjoon-hyun commented Jun 25, 2021

Uh oh!

arghya18 commented Jun 25, 2021

Uh oh!

steveloughran commented Jun 25, 2021

Uh oh!

arghya18 commented Jun 25, 2021

Uh oh!

steveloughran commented Jun 25, 2021

Uh oh!

arghya18 commented Jun 25, 2021

Uh oh!

arghya18 commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sunchao commented Oct 22, 2020 •

edited by dongjoon-hyun

Loading

sunchao commented Oct 22, 2020 •

edited

Loading

dongjoon-hyun commented Oct 23, 2020 •

edited

Loading

dongjoon-hyun commented Jun 19, 2021 •

edited

Loading

arghya18 commented Jul 2, 2021 •

edited

Loading

dongjoon-hyun commented Jul 2, 2021 •

edited

Loading

sunchao commented Aug 6, 2021 •

edited

Loading