Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Oct 22, 2020

What changes were proposed in this pull request?

This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file.

Why are the changes needed?

Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Existing unit tests in Spark
  • Manually tested using my S3 bucket for event log dir:
bin/spark-shell \
  -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \
  -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \
  -c spark.eventLog.enabled=true
  -c spark.eventLog.dir=s3a://<my-bucket>
  • Manually tested against docker-based YARN dev cluster, by running SparkPi.

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34781/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34781/

@dongjoon-hyun
Copy link
Member

The failure seems to be relevant since they are YARN test suite.

Error:  Error: Total 118, Failed 0, Errors 3, Passed 115, Canceled 1
Error:  Error during tests:
Error:  	org.apache.spark.deploy.yarn.YarnClusterSuite
Error:  	org.apache.spark.deploy.yarn.YarnShuffleAuthSuite
Error:  	org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite

@dongjoon-hyun
Copy link
Member

BTW, thank you so much, @sunchao !

@sunchao
Copy link
Member Author

sunchao commented Oct 22, 2020

Yes the test failures are new. It seems to be related to #29843 but I'm not sure why I've never seen it before in all the previous runs.

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130175 has finished for PR 30135 at commit 1a4cdeb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Oct 22, 2020

And all the tests in jenkins are passing. This is weird ...

@dongjoon-hyun
Copy link
Member

Oh..

@dongjoon-hyun
Copy link
Member

I retriggered GitHub Action. Let's see.

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-29250] Upgrade to Hadoop 3.3.0 [WIP][SPARK-29250][BUILD] Upgrade to Hadoop 3.3.0 Oct 22, 2020
@dongjoon-hyun
Copy link
Member

cc @srowen , @dbtsai

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-29250][BUILD] Upgrade to Hadoop 3.3.0 [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.0 Oct 22, 2020
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review October 22, 2020 22:17
@dongjoon-hyun
Copy link
Member

This seems to hit bouncycastle.

$ build/sbt "yarn/testOnly *.YarnClusterSuite" -Pyarn -Phadoop-3.2
...
[info] YarnClusterSuite:
[info] org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (580 milliseconds)
[info]   java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/bouncycastle/operator/OperatorCreationException

@dongjoon-hyun dongjoon-hyun marked this pull request as draft October 22, 2020 22:20
@srowen
Copy link
Member

srowen commented Oct 22, 2020

What does it imply for compatibility with clusters? it works with Hadoop 3.2, earlier? we may need to rename the profile eventually to "hadoop-3" or something. Implicitly it's "Hadoop 3.2+"

@sunchao
Copy link
Member Author

sunchao commented Oct 22, 2020

Yeah it should work with Hadoop 3.2+ clusters since Hadoop maintains wire compatibility across minor releases. +1 on renaming this to hadoop-3 (and probably rename hadoop-2.7 to hadoop-2 as well).

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 23, 2020

I retriggered GitHub Action once more.

@srowen and @sunchao . For renaming, +1. Shall we do in another PR because it will touch many scripts?

  • hadoop-3.2 -> hadoop-3
  • hadoop-2.7 -> hadoop-2.

@github-actions
Copy link

github-actions bot commented Feb 1, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 1, 2021
@github-actions github-actions bot closed this Feb 2, 2021
@sunchao
Copy link
Member Author

sunchao commented Feb 2, 2021

This is staled because Spark can't use Hadoop 3.3.0 as it is (see HADOOP-16080). The work may be resumed once Hadoop 3.3.1 is released.

@dongjoon-hyun
Copy link
Member

Thank you for keeping working on this area.

@HyukjinKwon
Copy link
Member

👍

@sunchao
Copy link
Member Author

sunchao commented May 24, 2021

Sine Hadoop 3.3.1 RC1 is out, I'm going to revive this PR and test it here.

@dbtsai dbtsai reopened this May 24, 2021
@sunchao
Copy link
Member Author

sunchao commented Jun 16, 2021

Thanks @dongjoon-hyun for merging!

@sunchao sunchao deleted the SPARK-29250 branch June 16, 2021 23:01
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 19, 2021

Hi, All. FYI,

From Apache Hadoop 3.3.1, we reverted HADOOP-16878 as the last revert commit on branch-3.3.1.

However, Apache Spark Jenkins seems to hit a flakiness issue.

org.apache.hadoop.fs.PathOperationException: 
`Source (file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar) and destination (/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar)
are equal in the copy command.': Operation not supported
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:403)

Please note that this is a flakiness and at the same time the error log is pointing FileUtil.java:403 which is the code before reverting. I'm investigating this.

cc @gengliangwang since he is the release manager for Apache Spark 3.2.0.

@dongjoon-hyun
Copy link
Member

Since Apache Hadoop trunk has this behavior, I made a PR in order to be more robust on the underlying behavior difference.

@gengliangwang
Copy link
Member

@dongjoon-hyun Thanks for taking care of the error!

@arghya18
Copy link

What's the latest on this? Any issue to use Hadoop 3.3.1 with Spark?

@dongjoon-hyun
Copy link
Member

We are preparing Apache Spark 3.2.0 with Hadoop 3.3.1, @arghya18 .

@arghya18
Copy link

@dongjoon-hyun Thanks for the info. Actually we are having some major issue related to s3a HADOOP-17755 with orc which seems to be solved in Hadoop 3.3.0 for parquet but seems its a common issue . HADOOP-16109 . Currently we are working around it using fs.s3a.readahead.range": "1G" which is not a good idea.

I am using Spark on k8s and magic committer, so wanted to check if the above issues mentioned will impact me if I manually build Spark 3.1.1 with Hadoop 3.3.0/3.3.1?

@dongjoon-hyun
Copy link
Member

Thank you for sharing, @arghya18 .
HADOOP-17755 sounds like read-side issue and Magic committer is write-side feature. I don't think they are related. If you hit a magic committer issue, please file a JIRA at Apache Hadoop community.

@arghya18
Copy link

@dongjoon-hyun Thanks for your response. Yes I understand magic committer is not related, I just wanted to understand if I build Spark 3.1.1 with Hadoop 3.3.0/3.3.1 will the existing issues (SPARK-35831, SPARK-35868, SPARK-35878) to be merged will impact my use case? I am using s3a to read and write orc files or I have to wait for Spark 3.2.0 ?

@steveloughran
Copy link
Contributor

@arghya18

  1. we havent seen HADOOP-17755 in any of our testing, unless it is HADOOP-16109
  2. still waiting on that JIRA for you to provide config details. Like I've said: no info, we close the JIRA.

wanted to check if the above issues mentioned will impact me if I manually build Spark 3.1.1 with Hadoop 3.3.0/3.3.1?

well, what can anyone say? without info from you all we can say is WORKSFORME

The best thing you can do to help everyone is checkout and build with Hadoop 3.3.1 (which does contain a lot of changes related to the committers, including from @dongjoon-hyun), and see if the problem is still there.

@arghya18
Copy link

The best thing you can do to help everyone is checkout and build with Hadoop 3.3.1 (which does contain a lot of changes related to the committers, including from @dongjoon-hyun), and see if the problem is still there.

@steveloughran Yes that is the plan. I will do and post the details of the result this weekend.

@steveloughran
Copy link
Contributor

the parquet EOF fix is also in hadoop-3.2.2, so you could try that. However, testing with 3.3.1 is better because

  1. we can do workarounds in spark before the release
  2. we can fix hadoop branch-3.3;
  3. any bug you file against hadoop 3.2.x will have "try with 3.3.1" as the initial respose

@arghya18
Copy link

@steveloughran Thanks for the suggestion.
Can anyone help me with steps (or Dockerfile) to change hadoop version in prebuilt spark docker image with Hadoop if available handy for me to quickly progress else I will google, newbie in docker :)

Thanks a lot.

@arghya18
Copy link

arghya18 commented Jul 2, 2021

@dongjoon-hyun @steveloughran I was able to test my use case with Hadoop 3.3.1 and posted the result HADOOP-17755
To my surprise the read is slower(with same resource and same config) in Hadoop 3.3.1 than Hadoop 3.2.0 without the mentioned issue. It is possible I am missing something.

@dongjoon-hyun
Copy link
Member

Thank you for sharing, @arghya18 . It's interesting. The read statistic increase is also observed in my environment, but TPCDS 1TB on S3 parquet performance was faster for me. I'll keep tracking HADOOP-17755 together.

@arghya18
Copy link

arghya18 commented Jul 2, 2021

@dongjoon-hyun Thanks.. I am testing more jobs for further statistics. BDW I am testing this on ORC.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 2, 2021

Oh, if you are using ORC, please try to bring SPARK-35783. It's irrelevant to this Hadoop topic, but it helps you reduce the traffic.

@arghya18
Copy link

arghya18 commented Jul 2, 2021

@dongjoon-hyun Ok.. I will do that but I thought read and time increase is impact of s3a implemented on Hadoop 3.3.1 as that changed only after I upgraded from Hadoop 3.2.0 to Hadoop 3.3.1

@steveloughran
Copy link
Contributor

To my surprise the read is slower(with same resource and same config) in Hadoop 3.3.1 than Hadoop 3.2.0 without the mentioned issue. It is possible I am missing something.

Shouldn't happen. really shouldn't happen. We do not see that on our TCP-DS Benchmarks.

The main way I could see this happening is if the seek policy hasn't switched to random on the first backwards seek. Explicitly set it.

spark.hadoop.fs.s3a.experimental.fadvise random

Hadoop 3.3.1 has a stats collection API (IOStatisics) for filesystems, streams, etc.

  • call toString() on a stream to get its stats, inc #of bytes discarded in seeks, streams aborted
  • do the same for the FS to get the aggregate stats.

high counts of bytes discarded and aborts are signs of bad seek policy.

set these two logs at debug and see what they say.

org.apache.hadoop.fs.s3a.S3AInputStream
org.apache.hadoop.fs.s3a.S3AStorageStatistics

@gengliangwang
Copy link
Member

@sunchao @dongjoon-hyun @steveloughran Are there any other known issues of Hadoop 3.3.1? Should we have this upgrade in Spark 3.2?

@dongjoon-hyun
Copy link
Member

+1 for backporting!

@sunchao
Copy link
Member Author

sunchao commented Aug 6, 2021

I'm not aware of any issue at the moment - we've already been using this (although a slightly different internal Hadoop version) for a while now. @steveloughran can offer more inputs from Hadoop/S3 side.

Besides that, I wanted to add #33160 so that users can build Spark with older versions of Hadoop which do not support shaded client. If people feel this is useful I can try to push it to the finish line. I also wanted to add some user documentation on compiling Spark against different versions of Hadoop (besides the default one).

@steveloughran
Copy link
Contributor

known 3.3.1 regressions. Not AFAIK

@gengliangwang
Copy link
Member

@steveloughran Thanks for the info!
@sunchao then let's finish #33160 and have it in 3.2. Thanks!

ulysses-you pushed a commit to apache/kyuubi that referenced this pull request Oct 19, 2021
<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Spark 3.2.0 is out, which bundles Hadoop 3.3.1 shaded client in default. apache/spark#30135

The test failed when using Hadoop 3.3.1 client connects to Yarn Mini Cluster 3.2.2

```
Cause: java.lang.RuntimeException: org.apache.kyuubi.KyuubiSQLException:java.lang.ClassCastException: org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterMetricsRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:123)
	at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:271)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy13.getClusterMetrics(Unknown Source)
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:638)
	at org.apache.spark.deploy.yarn.Client.$anonfun$submitApplication$1(Client.scala:179)
	at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
	at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
	at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:65)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
	at org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:103)
	at org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:155)
	at org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
	at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165)
	at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request

Closes #757 from pan3793/hadoop-3.3.

Closes #757

7ec9313 [Cheng Pan] [DEPS] Bump Hadoop 3.3.1

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: ulysses-you <ulyssesyou@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants