Skip to content

[SPARK-50421][CORE] Fix executor related memory config incorrect when multiple resource profiles worked#48963

Closed
zjuwangg wants to merge 8 commits intoapache:masterfrom
zjuwangg:m_fixConfig
Closed

[SPARK-50421][CORE] Fix executor related memory config incorrect when multiple resource profiles worked#48963
zjuwangg wants to merge 8 commits intoapache:masterfrom
zjuwangg:m_fixConfig

Conversation

@zjuwangg
Copy link
Copy Markdown
Contributor

@zjuwangg zjuwangg commented Nov 26, 2024

What changes were proposed in this pull request?

Reset the executor's env memory related config when resource profile is not as the default resource profile!

Why are the changes needed?

When multiple resource profile exists in the same spark application, now the executor's memory related config is not override by resource profile's memory size, which will cause maxOffHeap in UnifiedMemoryManager is not correct.
See https://issues.apache.org/jira/browse/SPARK-50421 for more details

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests in our inner spark version and jobs.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Nov 26, 2024
@zjuwangg
Copy link
Copy Markdown
Contributor Author

@tgravescs Would you help to review this commit?

@zjuwangg zjuwangg changed the title [SPARK-50421][wip]fix executor related memory config incorrect when multiple resource profile worked [SPARK-50421]fix executor related memory config incorrect when multiple resource profile worked Nov 26, 2024
case (ResourceProfile.OFFHEAP_MEM, request) =>
driverConf.set(MEMORY_OFFHEAP_SIZE.key, request.amount.toString + "m")
if (request.amount > 0) {
driverConf.set(MEMORY_OFFHEAP_ENABLED.key, "true")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is questionable to me, we haven't supported setting confs yet so I would expect it to pick up the default config for this. I'd rather see a separate issue if we want to revisit this behavior.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR, we just encountered this problem.
Is the logic of resetting the offheap configuration better this way?

case (ResourceProfile.OFFHEAP_MEM, request)  if  request.amount > 0 =>
       driverConf.set(MEMORY_OFFHEAP_SIZE.key, request.amount.toString + "m")
       driverConf.set(MEMORY_OFFHEAP_ENABLED.key, "true")

Copy link
Copy Markdown
Contributor Author

@zjuwangg zjuwangg Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is questionable to me, we haven't supported setting confs yet so I would expect it to pick up the default config for this. I'd rather see a separate issue if we want to revisit this behavior.

Make sense to me. I'll address it later, how about introducing a config to control whether to control offheap enabled or not?

Copy link
Copy Markdown
Contributor Author

@zjuwangg zjuwangg Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR, we just encountered this problem. Is the logic of resetting the offheap configuration better this way?

case (ResourceProfile.OFFHEAP_MEM, request)  if  request.amount > 0 =>
       driverConf.set(MEMORY_OFFHEAP_SIZE.key, request.amount.toString + "m")
       driverConf.set(MEMORY_OFFHEAP_ENABLED.key, "true")

@xumanbu
I think this is incorrect. Imagine such a scenario where default resource profile's offheap is 512M, and resource profile with id 2 's offheap is 0, in such case the offheap config will be incorrect!

Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
@zjuwangg zjuwangg changed the title [SPARK-50421]fix executor related memory config incorrect when multiple resource profile worked [SPARK-50421]Fix executor related memory config incorrect when multiple resource profile worked Dec 3, 2024
@zjuwangg
Copy link
Copy Markdown
Contributor Author

zjuwangg commented Dec 3, 2024

@tgravescs I just updated the commit as the comments. Please help review when you have free time.

@tgravescs
Copy link
Copy Markdown
Contributor

what all testing have you done with this? It would be nice to have a integration test with it but I know this would be difficult.

@zjuwangg
Copy link
Copy Markdown
Contributor Author

zjuwangg commented Dec 4, 2024

what all testing have you done with this? It would be nice to have a integration test with it but I know this would be difficult.

I have test this in our inner spark version and UnifiedMemoryManager worked as expected. I just tried to add a integration test but found it's hard since there no easy way to mock driver and executor to verify the code.

Copy link
Copy Markdown
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 looks fine to me.

@zjuwangg
Copy link
Copy Markdown
Contributor Author

zjuwangg commented Dec 5, 2024

+1 looks fine to me.

Thanks! Would you help merge it?
I think this also should be included in spark 3.5.4

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-50421]Fix executor related memory config incorrect when multiple resource profile worked [SPARK-50421][CORE] Fix executor related memory config incorrect when multiple resource profiles worked Dec 5, 2024
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @zjuwangg , @tgravescs , @xumanbu .

Merged to master.

cc @LuciferYang as a release manager of Apache Spark 3.5.4

@dongjoon-hyun
Copy link
Copy Markdown
Member

Due to the Structured Log change, this patch is not applicable to branch-3.5.

Please make a backporting PR to branch-3.5, @zjuwangg .

@dongjoon-hyun
Copy link
Copy Markdown
Member

BTW, I added you to Apache Spark Contributor group in JIRA and assigned SPARK-50421 to you, @zjuwangg .

Welcome to the Apache Spark community and thank you again.

@zjuwangg
Copy link
Copy Markdown
Contributor Author

zjuwangg commented Dec 6, 2024

Due to the Structured Log change, this patch is not applicable to branch-3.5.

Please make a backporting PR to branch-3.5, @zjuwangg .

I'll make a backporting PR to branch-3.5 soon.

@zjuwangg
Copy link
Copy Markdown
Contributor Author

zjuwangg commented Dec 6, 2024

Due to the Structured Log change, this patch is not applicable to branch-3.5.
Please make a backporting PR to branch-3.5, @zjuwangg .

I'll make a backporting PR to branch-3.5 soon.

Backporting PR: #49090

@zjuwangg zjuwangg deleted the m_fixConfig branch December 6, 2024 08:33
dongjoon-hyun pushed a commit that referenced this pull request Dec 6, 2024
… when multiple resource profiles worked

### What changes were proposed in this pull request?

Reset the executor's env memory related config when resource profile is not as the default resource profile!

### Why are the changes needed?
When multiple resource profile exists in the same spark application, now the executor's memory related config is not override by resource profile's memory size, which will cause maxOffHeap in `UnifiedMemoryManager` is not correct.
See https://issues.apache.org/jira/browse/SPARK-50421 for more details

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tests in our inner spark version and jobs.

### Was this patch authored or co-authored using generative AI tooling?
No

This is a backporting from #48963 to branch 3.5

Closes #49090 from zjuwangg/m35_fixConfig.

Authored-by: Terry Wang <zjuwangg@foxmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
log"${MDC(LogKeys.EXECUTOR_MEMORY_OVERHEAD_SIZE, request)}")
case (ResourceProfile.CORES, request) =>
driverConf.set(EXECUTOR_CORES.key, request.amount.toString)
logInfo(log"Set executor cores to ${MDC(LogKeys.NUM_EXECUTOR_CORES, request)}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we need to do this for PYSPARK_MEM as well ? (from ResourceProfile.allSupportedExecutorResources)

Copy link
Copy Markdown
Contributor Author

@zjuwangg zjuwangg Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYSPARK_MEM seems no actual use during task execute like EXECUTOR_MEMORY_OVERHEAD, but I think it's nice to have it here to keep consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants