-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 #30135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
The failure seems to be relevant since they are YARN test suite. |
|
BTW, thank you so much, @sunchao ! |
|
Yes the test failures are new. It seems to be related to #29843 but I'm not sure why I've never seen it before in all the previous runs. |
|
Test build #130175 has finished for PR 30135 at commit
|
|
And all the tests in jenkins are passing. This is weird ... |
|
Oh.. |
|
I retriggered GitHub Action. Let's see. |
|
This seems to hit |
|
What does it imply for compatibility with clusters? it works with Hadoop 3.2, earlier? we may need to rename the profile eventually to "hadoop-3" or something. Implicitly it's "Hadoop 3.2+" |
|
Yeah it should work with Hadoop 3.2+ clusters since Hadoop maintains wire compatibility across minor releases. +1 on renaming this to hadoop-3 (and probably rename |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
This is staled because Spark can't use Hadoop 3.3.0 as it is (see HADOOP-16080). The work may be resumed once Hadoop 3.3.1 is released. |
|
Thank you for keeping working on this area. |
|
👍 |
|
Sine Hadoop 3.3.1 RC1 is out, I'm going to revive this PR and test it here. |
|
Thanks @dongjoon-hyun for merging! |
|
Hi, All. FYI, From Apache Hadoop 3.3.1, we reverted HADOOP-16878 as the last revert commit on However, Apache Spark Jenkins seems to hit a flakiness issue. Please note that this is a flakiness and at the same time the error log is pointing cc @gengliangwang since he is the release manager for Apache Spark 3.2.0. |
|
Since Apache Hadoop trunk has this behavior, I made a PR in order to be more robust on the underlying behavior difference. |
|
@dongjoon-hyun Thanks for taking care of the error! |
|
What's the latest on this? Any issue to use Hadoop 3.3.1 with Spark? |
|
We are preparing Apache Spark 3.2.0 with Hadoop 3.3.1, @arghya18 .
|
|
@dongjoon-hyun Thanks for the info. Actually we are having some major issue related to s3a HADOOP-17755 with orc which seems to be solved in Hadoop 3.3.0 for parquet but seems its a common issue . HADOOP-16109 . Currently we are working around it using fs.s3a.readahead.range": "1G" which is not a good idea. I am using Spark on k8s and magic committer, so wanted to check if the above issues mentioned will impact me if I manually build Spark 3.1.1 with Hadoop 3.3.0/3.3.1? |
|
Thank you for sharing, @arghya18 . |
|
@dongjoon-hyun Thanks for your response. Yes I understand magic committer is not related, I just wanted to understand if I build Spark 3.1.1 with Hadoop 3.3.0/3.3.1 will the existing issues (SPARK-35831, SPARK-35868, SPARK-35878) to be merged will impact my use case? I am using s3a to read and write orc files or I have to wait for Spark 3.2.0 ? |
well, what can anyone say? without info from you all we can say is WORKSFORME The best thing you can do to help everyone is checkout and build with Hadoop 3.3.1 (which does contain a lot of changes related to the committers, including from @dongjoon-hyun), and see if the problem is still there. |
@steveloughran Yes that is the plan. I will do and post the details of the result this weekend. |
|
the parquet EOF fix is also in hadoop-3.2.2, so you could try that. However, testing with 3.3.1 is better because
|
|
@steveloughran Thanks for the suggestion. Thanks a lot. |
|
@dongjoon-hyun @steveloughran I was able to test my use case with Hadoop 3.3.1 and posted the result HADOOP-17755 |
|
Thank you for sharing, @arghya18 . It's interesting. The read statistic increase is also observed in my environment, but TPCDS 1TB on S3 parquet performance was faster for me. I'll keep tracking HADOOP-17755 together. |
|
@dongjoon-hyun Thanks.. I am testing more jobs for further statistics. BDW I am testing this on ORC. |
|
Oh, if you are using ORC, please try to bring SPARK-35783. It's irrelevant to this Hadoop topic, but it helps you reduce the traffic. |
|
@dongjoon-hyun Ok.. I will do that but I thought read and time increase is impact of s3a implemented on Hadoop 3.3.1 as that changed only after I upgraded from Hadoop 3.2.0 to Hadoop 3.3.1 |
Shouldn't happen. really shouldn't happen. We do not see that on our TCP-DS Benchmarks. The main way I could see this happening is if the seek policy hasn't switched to random on the first backwards seek. Explicitly set it. Hadoop 3.3.1 has a stats collection API (IOStatisics) for filesystems, streams, etc.
high counts of bytes discarded and aborts are signs of bad seek policy. set these two logs at debug and see what they say. |
|
@sunchao @dongjoon-hyun @steveloughran Are there any other known issues of Hadoop 3.3.1? Should we have this upgrade in Spark 3.2? |
|
+1 for backporting! |
|
I'm not aware of any issue at the moment - we've already been using this (although a slightly different internal Hadoop version) for a while now. @steveloughran can offer more inputs from Hadoop/S3 side. Besides that, I wanted to add #33160 so that users can build Spark with older versions of Hadoop which do not support shaded client. If people feel this is useful I can try to push it to the finish line. I also wanted to add some user documentation on compiling Spark against different versions of Hadoop (besides the default one). |
|
known 3.3.1 regressions. Not AFAIK |
|
@steveloughran Thanks for the info! |
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html 2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'. 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'. --> ### _Why are the changes needed?_ <!-- Please clarify why the changes are needed. For instance, 1. If you add a feature, you can talk about the use case of it. 2. If you fix a bug, you can clarify why it is a bug. --> Spark 3.2.0 is out, which bundles Hadoop 3.3.1 shaded client in default. apache/spark#30135 The test failed when using Hadoop 3.3.1 client connects to Yarn Mini Cluster 3.2.2 ``` Cause: java.lang.RuntimeException: org.apache.kyuubi.KyuubiSQLException:java.lang.ClassCastException: org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterMetricsRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:123) at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:271) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy13.getClusterMetrics(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:638) at org.apache.spark.deploy.yarn.Client.$anonfun$submitApplication$1(Client.scala:179) at org.apache.spark.internal.Logging.logInfo(Logging.scala:57) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56) at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:65) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220) at org.apache.spark.SparkContext.<init>(SparkContext.scala:581) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943) at org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:103) at org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:155) at org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.readthedocs.io/en/latest/tools/testing.html#running-tests) locally before make a pull request Closes #757 from pan3793/hadoop-3.3. Closes #757 7ec9313 [Cheng Pan] [DEPS] Bump Hadoop 3.3.1 Authored-by: Cheng Pan <379377944@qq.com> Signed-off-by: ulysses-you <ulyssesyou@apache.org>
What changes were proposed in this pull request?
This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file.
Why are the changes needed?
Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment.
Does this PR introduce any user-facing change?
No
How was this patch tested?
SparkPi.