[SPARK-33545][CORE] Support Fallback Storage during Worker decommission #30492

dongjoon-hyun · 2020-11-24T21:46:32Z

What changes were proposed in this pull request?

This PR aims to support storage migration to the fallback storage like cloud storage (S3) during worker decommission for the corner cases where the exceptions occur or there is no live peer left.

Although this PR focuses on cloud storage like S3 which has a TTL feature in order to simplify Spark's logic, we can use alternative fallback storages like HDFS/NFS(EFS) if the user provides a clean-up mechanism.

Why are the changes needed?

Currently, storage migration is not possible when there is no available executor. For example, when there is one executor, the executor cannot perform storage migration because it has no peer.

Does this PR introduce any user-facing change?

Yes. This is a new feature.

How was this patch tested?

Pass the CIs with newly added test cases.

dongjoon-hyun · 2020-11-24T22:05:54Z

Could you review this, @holdenk , @viirya , @mridulm ?

core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala

SparkQA · 2020-11-25T13:34:55Z

Test build #131768 has finished for PR 30492 at commit 25740c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Tagar · 2020-11-25T16:59:33Z

@dongjoon-hyun is this only for shuffled data? I was wondering if it would also be possible to cover MEMORY_AND_DISK for cached dataframes? Thanks!

dongjoon-hyun · 2020-11-25T17:12:32Z

Thank you for review, @Tagar . Yes, they are separate options:

spark.storage.decommission.shuffleBlocks.enabled
spark.storage.decommission.rddBlocks.enabled

This PR is still under review. If this is accepted, I believe we can move on to that.

dongjoon-hyun · 2020-11-25T18:06:39Z

Could you review this PR once more please, @viirya and @mridulm ?

holdenk · 2020-11-25T19:18:58Z

Hi @dongjoon-hyun, I'm taking this week away from open source to take my puppy to go see snow for the first time. I'll do a review on Monday. Thanks for understanding :)

dongjoon-hyun · 2020-11-25T19:23:40Z

Thank you, @holdenk . Sorry for pinging you on the holiday season.

dongjoon-hyun · 2020-11-25T19:26:11Z

BTW, cc @dbtsai , too.

core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala

core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

dongjoon-hyun · 2020-11-27T01:32:48Z

Thank you so much for your review, @viirya !

core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

dongjoon-hyun · 2020-11-29T23:11:44Z

Hi, @zsxwing and @viirya . I addressed your comments. Could you review once more please if you have a chance?

zsxwing · 2020-11-30T00:53:42Z

Thanks for adding the application id. I don't have time to look at details recently. Will defer that to @viirya

SparkQA · 2020-11-30T01:56:22Z

Test build #131939 has finished for PR 30492 at commit a0285dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-30T05:22:27Z

Got it, @zsxwing .

dongjoon-hyun · 2020-11-30T17:00:09Z

Hi, @holdenk and @viirya and @dbtsai . Could you review this, please?

holdenk

Thanks for waiting on the review. I’ve got a few minor nits/ questions but overall this looks good to me. I’m excited for us to support this as part of dynamic scale down in Spark 3.1 :)

core/pom.xml

core/src/main/scala/org/apache/spark/internal/config/package.scala

holdenk · 2020-11-30T19:01:14Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+      try {
+        shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+      } catch {
+        case e: IOException =>


Nit/ question: Could we move the if up as part of the case and avoid the need for explicit rethrow?

It's because we should access the normal access path by default. We are able to use Fallback path only if the normal access path fail because conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined doesn't mean the fallback storage has the data.

Please let me know if I misunderstand your comment.

Yeah so I mean we still have a try/catch just the case statement has the if as part of it (eg “ Using if expressions in case statements”)

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

holdenk

LGTM

viirya

Looks okay to me.

dongjoon-hyun · 2020-11-30T21:29:25Z

Thank you so much, @holdenk and @viirya .
Merged to master for Apache Spark 3.1.0.

[SPARK-33545][CORE] Support Fallback Storage during Worker decommission

14274a5

github-actions bot added BUILD CORE labels Nov 24, 2020

Update

32339b4

dongjoon-hyun requested review from holdenk and viirya November 24, 2020 22:02

viirya reviewed Nov 24, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala Outdated Show resolved Hide resolved

Address comments

8d41db6

This comment has been minimized.

Sign in to view

Remove unused imports

9395c4b

This comment has been minimized.

Sign in to view

dongjoon-hyun added 2 commits November 24, 2020 23:31

fix mocking

025d9aa

Split DiskBlockManager

0b5cd67

This comment has been minimized.

Sign in to view

split dbm

25740c2

This comment has been minimized.

Sign in to view