[SPARK-32436][CORE] Initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal by dongjoon-hyun · Pull Request #29231 · apache/spark

dongjoon-hyun · 2020-07-25T07:35:39Z

What changes were proposed in this pull request?

This PR aims to initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal.

In Scala 2.12, this is initialized to -1 via the following.

protected def this() = this(null, -1, null, -1, null, -1)  // For deserialization only

Why are the changes needed?

In Scala 2.13, this causes several UT failures because HighlyCompressedMapStatus.readExternal doesn't initialize this field. The following is one example.

org.apache.spark.scheduler.MapStatusSuite

MapStatusSuite:
- compressSize
- decompressSize
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: numNonEmptyBlocks
  at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181)
  at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281)
  at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73)
  at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64)
  at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18)
  at scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61)
  at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18)
  at scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60)
  ...

Does this PR introduce any user-facing change?

No. This is a private class.

How was this patch tested?

Pass the GitHub Action or Jenkins with the existing tests.
Test with Scala-2.13 with MapStatusSuite.

$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite
...
MapStatusSuite:
- compressSize
- decompressSize
- MapStatus should never report non-empty blocks' sizes as 0
- large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus
- HighlyCompressedMapStatus: estimated size should be the average non-empty block size
- SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize
- RoaringBitmap: runOptimize succeeded
- RoaringBitmap: runOptimize failed
- Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated.
- SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE
Run completed in 7 seconds, 971 milliseconds.
Total number of tests run: 10
Suites: completed 2, aborted 0
Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

…tyBlocks

…umNonEmptyBlocks" This reverts commit 9753676.

SparkQA · 2020-07-25T10:28:55Z

Test build #126535 has finished for PR 29231 at commit 6654013.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-25T10:33:15Z

Could you review this, @HyukjinKwon ?

dongjoon-hyun · 2020-07-25T10:33:27Z

Also, cc @srowen .

srowen · 2020-07-25T14:18:47Z

If the fix works, great. I think this might help me understand another failure in the streaming component.
I'm trying to understand why it fails now. Externalizable should still start by constructing this object with the no-arg constructor, which does init this field. I think the key is that Externalizable wants a public no-arg constructor and this isn't public, and for some reason that is now a problem.

Other possible fixes? make it public (the class is private anyway).

dongjoon-hyun · 2020-07-25T15:26:53Z

Let me try, @srowen .

Other possible fixes? make it public (the class is private anyway).

dongjoon-hyun · 2020-07-25T15:37:37Z

I tried to remove all constraints; 1) public no-op constructor, 2) public constructor, 3) public class. But, it turns out that it cannot resolve this issue.

-private[spark] class HighlyCompressedMapStatus private (
+class HighlyCompressedMapStatus(
     private[this] var loc: BlockManagerId,
     private[this] var numNonEmptyBlocks: Int,
     private[this] var emptyBlocks: RoaringBitmap,
@@ -181,7 +181,7 @@ private[spark] class HighlyCompressedMapStatus private (
     || numNonEmptyBlocks == 0 || _mapTaskId > 0,
     "Average size can only be zero for map stages that produced no output")

-  protected def this() = this(null, -1, null, -1, null, -1)  // For deserialization only
+  def this() = this(null, -1, null, -1, null, -1)  // For deserialization only

   override def location: BlockManagerId = loc

@@ -217,7 +217,7 @@ private[spark] class HighlyCompressedMapStatus private (

   override def readExternal(in: ObjectInput): Unit = Utils.tryOrIOException {
     loc = BlockManagerId(in)
-    numNonEmptyBlocks = -1 // SPARK-32436 Scala 2.13 doesn't initialize this during deserialization
+    // numNonEmptyBlocks = -1 // SPARK-32436 Scala 2.13 doesn't initialize this during deserialization

dongjoon-hyun · 2020-07-25T15:42:39Z

This PR technically doesn't change the logic. In other words, the value will be the same in Scala 2.12 and 2.13. Can we move forward with AS-IS patch?

BTW, RDDSuite has another failure which I'm working on now.

dongjoon-hyun · 2020-07-25T15:47:55Z

I updated the PR description by focusing on MapStatusSuite to be more clear.

srowen · 2020-07-25T17:14:16Z

Hm, weird. I still don't understand why this behavior is different in 2.13. OK, go ahead.

dongjoon-hyun · 2020-07-25T17:15:33Z

Thanks! Merged to master.

HyukjinKwon · 2020-07-26T10:23:56Z

LGTM

dongjoon-hyun · 2020-07-26T15:25:20Z

Thank you, @HyukjinKwon .

mridulm · 2020-08-10T22:31:16Z

@dongjoon-hyun Circling through older PR's ... do we know why this is happening ?
More than the specifics of this class, I am more concerned for other classes where the initialization might not be happening, and we are not (yet) detecting the issue.

dongjoon-hyun · 2020-08-10T22:46:52Z

Hi, @mridulm ! This was Scala 2.13 and 2.12.12 bug.

Constructors handles private[this] var scala/bug#12096
[2.13 regression] Scala 2.13 miscompiles Spark: it forgets to emit a field leading to a NoSuchFieldError at runtime scala/bug#12002

I believe they will fix the bug in the next releases. Then, we don't need to detect or change inside Apache Spark.

mridulm · 2020-08-11T07:03:59Z

Thanks @dongjoon-hyun !

dossett · 2020-10-16T12:23:30Z

@dongjoon-hyun Will this be included in a Spark 3.0.x release or is the plan to wait for a fix on the scala side? I ran into this very issue today, so just wondering. Thank you.

srowen · 2020-10-16T12:56:19Z

This only seems to affect scala 2.13, regardless, and only 3.1.x supports scala 2.13, so no there isn't a need to put it in 3.0.x. The workaround doesn't require a scala fix if any, but, that may also resolve it anyway.

dossett · 2020-10-16T13:03:01Z

Thank you @srowen, the environment I saw this on was running spark 3.0.1 and scala 2.12.12. If I can reproduce it today I can share a stack trace and other details if that would be helpful.

dossett · 2020-10-16T13:53:57Z

Running on GCP's dataproc 2.0:

aniskodedossett@dossett-delta-w-0:~$ spark-sql --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.12, OpenJDK 64-Bit Server VM, 1.8.0_265
Branch HEAD
Compiled by user  on 2020-09-17T10:01:54Z
Revision 0aee93de8ef2a90403093b91843de9777b7ab5ef
Url https://bigdataoss-internal.googlesource.com/third_party/apache/spark
Type --help for more information.

I'm playing with the Databricks Delta Lake a simple vacuum command fails with a long stack trace with this at the bottom:

Caused by: java.lang.NoSuchFieldError: numNonEmptyBlocks
	at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:174)
	at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:269)
	at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:70)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

If this is helpful I'm happy to provide more information.

srowen · 2020-10-16T15:24:39Z

That is strange - it doesn't seem to happen in Spark unit tests in 2.12. But Spark is on 2.12.10. From the links above, it seems like it could be an issue in 2.12.12. Therefore @dongjoon-hyun it might be useful to backport this just in case? It's a small change, and i think what it does is prevent the compiler from (incorrectly?) eliding the field during compilation. Even when that's fixed this change doesn't hurt anything really.

dongjoon-hyun · 2020-10-16T20:29:26Z

I agree with you guys, @srowen and @dossett ! Sure, I'll test and backport this.

dongjoon-hyun · 2020-10-16T20:40:05Z

BTW, for the related Scala issues, I linked here.

…apStatus.readExternal ### What changes were proposed in this pull request? This PR aims to initialize `numNonEmptyBlocks` in `HighlyCompressedMapStatus.readExternal`. In Scala 2.12, this is initialized to `-1` via the following. ```scala protected def this() = this(null, -1, null, -1, null, -1) // For deserialization only ``` ### Why are the changes needed? In Scala 2.13, this causes several UT failures because `HighlyCompressedMapStatus.readExternal` doesn't initialize this field. The following is one example. - org.apache.spark.scheduler.MapStatusSuite ``` MapStatusSuite: - compressSize - decompressSize *** RUN ABORTED *** java.lang.NoSuchFieldError: numNonEmptyBlocks at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181) at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281) at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64) at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61) at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60) ... ``` ### Does this PR introduce _any_ user-facing change? No. This is a private class. ### How was this patch tested? 1. Pass the GitHub Action or Jenkins with the existing tests. 2. Test with Scala-2.13 with `MapStatusSuite`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite ... MapStatusSuite: - compressSize - decompressSize - MapStatus should never report non-empty blocks' sizes as 0 - large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus - HighlyCompressedMapStatus: estimated size should be the average non-empty block size - SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize - RoaringBitmap: runOptimize succeeded - RoaringBitmap: runOptimize failed - Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated. - SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE Run completed in 7 seconds, 971 milliseconds. Total number of tests run: 10 Suites: completed 2, aborted 0 Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29231 from dongjoon-hyun/SPARK-32436. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f9f1867) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2020-10-16T21:04:56Z

This lands at branch-3.0 now.

…apStatus.readExternal ### What changes were proposed in this pull request? This PR aims to initialize `numNonEmptyBlocks` in `HighlyCompressedMapStatus.readExternal`. In Scala 2.12, this is initialized to `-1` via the following. ```scala protected def this() = this(null, -1, null, -1, null, -1) // For deserialization only ``` ### Why are the changes needed? In Scala 2.13, this causes several UT failures because `HighlyCompressedMapStatus.readExternal` doesn't initialize this field. The following is one example. - org.apache.spark.scheduler.MapStatusSuite ``` MapStatusSuite: - compressSize - decompressSize *** RUN ABORTED *** java.lang.NoSuchFieldError: numNonEmptyBlocks at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181) at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281) at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64) at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61) at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60) ... ``` ### Does this PR introduce _any_ user-facing change? No. This is a private class. ### How was this patch tested? 1. Pass the GitHub Action or Jenkins with the existing tests. 2. Test with Scala-2.13 with `MapStatusSuite`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite ... MapStatusSuite: - compressSize - decompressSize - MapStatus should never report non-empty blocks' sizes as 0 - large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus - HighlyCompressedMapStatus: estimated size should be the average non-empty block size - SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize - RoaringBitmap: runOptimize succeeded - RoaringBitmap: runOptimize failed - Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated. - SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE Run completed in 7 seconds, 971 milliseconds. Total number of tests run: 10 Suites: completed 2, aborted 0 Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#29231 from dongjoon-hyun/SPARK-32436. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f9f1867) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-32436][CORE] Remove unused HighlyCompressedMapStatus.numNonEmp…

9753676

…tyBlocks

dongjoon-hyun closed this Jul 25, 2020

dongjoon-hyun added 2 commits July 25, 2020 00:55

Revert "[SPARK-32436][CORE] Remove unused HighlyCompressedMapStatus.n…

2bbfa4e

…umNonEmptyBlocks" This reverts commit 9753676.

fix

6654013

dongjoon-hyun reopened this Jul 25, 2020

probot-autolabeler Bot added the CORE label Jul 25, 2020

dongjoon-hyun changed the title ~~[SPARK-32436][CORE] Remove unused HighlyCompressedMapStatus.numNonEmptyBlocks~~ [SPARK-32436][CORE] Initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal Jul 25, 2020

This comment has been minimized.

Sign in to view

dongjoon-hyun closed this in f9f1867 Jul 25, 2020

dongjoon-hyun deleted the SPARK-32436 branch July 26, 2020 15:25

dongjoon-hyun mentioned this pull request Jul 26, 2020

Constructors handles private[this] var scala/bug#12096

Closed

dongjoon-hyun mentioned this pull request Dec 28, 2020

[DO-NOT-MERGE][SPARK-33913][SPARK-33921][BUILD] Upgrade kafka to 2.7.0 and Upgrade Scala version to 2.12.12 #30939

Closed

Conversation

dongjoon-hyun commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

This comment has been minimized.

SparkQA commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

srowen commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

srowen commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020

Uh oh!

HyukjinKwon commented Jul 26, 2020

Uh oh!

dongjoon-hyun commented Jul 26, 2020

Uh oh!

mridulm commented Aug 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Aug 11, 2020

Uh oh!

dossett commented Oct 16, 2020

Uh oh!

srowen commented Oct 16, 2020

Uh oh!

dossett commented Oct 16, 2020

Uh oh!

dossett commented Oct 16, 2020

Uh oh!

srowen commented Oct 16, 2020

Uh oh!

dongjoon-hyun commented Oct 16, 2020

Uh oh!

dongjoon-hyun commented Oct 16, 2020

Uh oh!

dongjoon-hyun commented Oct 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dongjoon-hyun commented Jul 25, 2020 •

edited

Loading

mridulm commented Aug 10, 2020 •

edited

Loading

dongjoon-hyun commented Aug 10, 2020 •

edited

Loading