[SPARK-48628][CORE] Add task peak on/off heap memory metrics by liuzqt · Pull Request #47776 · apache/spark

liuzqt · 2024-08-15T16:53:12Z

What changes were proposed in this pull request?

This PR is trying to revive #47192, which was reverted due to regression in ExternalAppendOnlyUnsafeRowArrayBenchmark.

Root cause
We eventually decided to aggregate peak memory usage from all consumers on each acquireExecutionMemory invocation. (see this discussion), which is O(n) complexity where n is the number of consumers.

ExternalAppendOnlyUnsafeRowArrayBenchmark is implemented in a way that all iterations are run in a single task context, therefore the number of consumers is exploding.

Notice that TaskMemoryManager.consumers is never cleaned up the whole lifecycle, and TaskMemoryManager.acquireExecutionMemory is a very frequent operation, doing a linear complexity(in terms of number of consumers) operation here might not be a good choice. This benchmark might be a corner case, but it's still possible to have a large number of consumers in a large query plan.

I fallback to the previous implementation: maintain current execution memory with an extra lock. cc @Ngone51

Benchmark result

ExternalAppendOnlyUnsafeRowArrayBenchmark-results
ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk21-results

Why are the changes needed?

Does this PR introduce any user-facing change?

NO

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

NO

### What changes were proposed in this pull request? Add task on/off heap execution memory in `TaskMetrics`, tracked in `TaskMemoryManager`, **assuming `acquireExecutionMemory` is the only one narrow waist for acquiring execution memory.** ### Why are the changes needed? Currently there is no task on/off heap execution memory metrics. There is a [peakExecutionMemory](https://github.com/apache/spark/blob/3cd35f8cb6462051c621cf49de54b9c5692aae1d/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L114) metrics, however, the semantic is a confusing: it only cover the execution memory used by shuffle/join/aggregate/sort, which is accumulated in specific operators and thus not really reflect the real execution memory. Therefore it's necessary to add these two metrics. Also I created two followup sub tickets: - https://issues.apache.org/jira/browse/SPARK-48788 : accumulate task metrics in stage, and display in Spark UI - https://issues.apache.org/jira/browse/SPARK-48789 : deprecate `peakExecutionMemory` once we have replacement for it. The ultimate goal is to have these two metrics ready (as accumulated stage metrics in Spark UI as well) and deprecate `peakExecutionMemory`. ### Does this PR introduce _any_ user-facing change? Supposedly no. But two followup sub tickets will have user-facing change: new metrics exposed to Spark UI, and old metrics deprecation. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#47192 from liuzqt/SPARK-48628. Authored-by: Ziqi Liu <ziqi.liu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>

dongjoon-hyun

Thank you, @liuzqt .

Please run ExternalAppendOnlyUnsafeRowArrayBenchmark and attach the result into this PR.

You can run the benchmark in your repository on your branch.

https://github.com/liuzqt/spark/actions/workflows/benchmark.yml

For the detail, please see Running benchmarks in your forked repository section of the Apache Spark developer-tools guideline.

https://spark.apache.org/developer-tools.html

liuzqt · 2024-08-16T04:04:30Z

@dongjoon-hyun updated the benchmark result, didn't see regression.

dongjoon-hyun · 2024-08-16T04:08:16Z

Oh, you need to make it as a part of this commit instead of the PR description. Please commit the result for both Java 17 and Java 21 in the following files, @liuzqt .

liuzqt · 2024-08-16T06:07:16Z

Oh, you need to make it as a part of this commit instead of the PR description. Please commit the result for both Java 17 and Java 21 in the following files, @liuzqt .

https://github.com/apache/spark/blob/master/sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt

https://github.com/apache/spark/blob/master/sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk21-results.txt

Updated the benchmark result in the PR

dongjoon-hyun · 2024-08-16T07:15:14Z

Thank you for attaching the result. The result looks good to me too, @liuzqt .

dongjoon-hyun

+1, LGTM from my side.

dongjoon-hyun · 2024-08-16T16:21:28Z

cc @cloud-fan , @Ngone51 , @JoshRosen from #47192 , too

liuzqt · 2024-08-21T09:34:52Z

Hi @dongjoon-hyun @mridulm @Ngone51 could you pls help merge this PR? thanks!

JoshRosen

LGTM as well: I think that this is a performance-sensitive enough path that it makes sense to incrementally maintain a separate count vs. repeatedly re-aggregating it, even if this comes at the price of some additional code. This code is comparatively low churn so I'm not especially worried about the risks of it going out of sync in refactorings.

JoshRosen · 2024-08-21T15:42:39Z

I've merged this to master (4.0.0). Thanks @liuzqt!

dongjoon-hyun · 2024-08-21T17:58:29Z

Thank you, @liuzqt and all!

liuzqt added 2 commits August 16, 2024 00:30

fix

4afd26e

github-actions Bot added the CORE label Aug 15, 2024

dongjoon-hyun requested changes Aug 15, 2024

View reviewed changes

liuzqt requested a review from dongjoon-hyun August 16, 2024 04:04

mridulm reviewed Aug 16, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java

update benchmark result

0c10aaa

github-actions Bot added the SQL label Aug 16, 2024

dongjoon-hyun approved these changes Aug 16, 2024

View reviewed changes

liuzqt requested a review from mridulm August 16, 2024 14:18

mridulm approved these changes Aug 17, 2024

View reviewed changes

Ngone51 approved these changes Aug 21, 2024

View reviewed changes

JoshRosen approved these changes Aug 21, 2024

View reviewed changes

JoshRosen closed this in 9b9a7a7 Aug 21, 2024

liuzqt mentioned this pull request Aug 26, 2024

[SPARK-48788][CORE][UI] Expose task peak onheap/offheap execution memory to API and Spark UI #47872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48628][CORE] Add task peak on/off heap memory metrics#47776

[SPARK-48628][CORE] Add task peak on/off heap memory metrics#47776
liuzqt wants to merge 3 commits intoapache:masterfrom
liuzqt:SPARK-48628

liuzqt commented Aug 15, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

liuzqt commented Aug 16, 2024

Uh oh!

dongjoon-hyun commented Aug 16, 2024

Uh oh!

Uh oh!

liuzqt commented Aug 16, 2024

Uh oh!

dongjoon-hyun commented Aug 16, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Aug 16, 2024

Uh oh!

liuzqt commented Aug 21, 2024

Uh oh!

JoshRosen left a comment

Uh oh!

JoshRosen commented Aug 21, 2024

Uh oh!

dongjoon-hyun commented Aug 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

liuzqt commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Benchmark result

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liuzqt commented Aug 16, 2024

Uh oh!

dongjoon-hyun commented Aug 16, 2024

Uh oh!

Uh oh!

liuzqt commented Aug 16, 2024

Uh oh!

dongjoon-hyun commented Aug 16, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 16, 2024

Uh oh!

liuzqt commented Aug 21, 2024

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 21, 2024

Uh oh!

dongjoon-hyun commented Aug 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

liuzqt commented Aug 15, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading