[GLUTEN-9344][VL] Document dynamic offheap sizing feature by zhouyuan · Pull Request #9391 · apache/gluten

zhouyuan · 2025-04-22T09:03:19Z

What changes were proposed in this pull request?

this patch adds documentation for dynamic off-heap sizing feature
fixes: #9344

How was this patch tested?

no need to test

Signed-off-by: Yuan Zhou <yuan.zhou@ibm.com>

github-actions · 2025-04-22T09:03:39Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2025-04-22T09:22:22Z

#9344

srinivasst · 2025-05-09T08:06:07Z

+- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.
+- When Velox tries to allocate memory, Gluten attempts to allocate from system memory and records this in the memory allocator.
+- If there is sufficient memory, allocations proceed normally.
+- If memory is insufficient, Spark performs garbage collection (GC) to free on-heap memory, allowing Velox to allocate memory.


In gluten-core/src/main/java/org/apache/gluten/memory/memtarget/DynamicOffHeapSizingMemoryTarget.java the borrow code only checks for uncommitted memory (size + usedOffHeapMemory + totalHeapMemory <= TOTAL_MEMORY_SHARED).

It does not acquire memory from the TaskMemoryManager

How will this trigger spill / GC?

FelixYBW · 2025-05-12T05:47:52Z

#9588

github-actions · 2025-06-27T02:11:22Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

philo-he

Some minor comments.

philo-he · 2025-08-16T04:13:28Z

@@ -0,0 +1,24 @@
+## Dynamic Off-heap Sizing
+Spark requires setting both on-heap and off-heap memory sizes, which initializes different memory layouts. Improper configuration of these settings can lead to lower performance. Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.


Nit: recommend to break this into multiple code lines, which is more readable for editors not wrapping text visually. And it can also improve the readability in diffs.
Ditto for other changes.

philo-he · 2025-08-16T07:10:22Z

+
+## Limitations
+
+This feature is still under heavy development.


Suggestion: ** is still in early development.

Signed-off-by: Yuan <yuanzhou@apache.org>

steveburnett

Just a few suggestions of formatting and phrasing.

steveburnett · 2025-08-19T18:49:07Z

+## Dynamic Off-heap Sizing
+Gluten requires setting both on-heap and off-heap memory sizes, which initializes different memory layouts. Improper configuration of these settings can lead to lower performance. 
+
+To fix this issue, Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.


Suggested change

To fix this issue, Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.

To fix this issue, dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.

steveburnett · 2025-08-19T18:49:33Z

+
+In general, the feature works as follows:
+
+- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.


Suggested change

- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.

- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by `spark.executor.memory`.

steveburnett · 2025-08-19T18:54:58Z

+- If memory is insufficient, Spark performs garbage collection (GC) to free on-heap memory, allowing Velox to allocate memory.
+- If memory remains insufficient after GC, Spark reports an out-of-memory (OOM) issue.
+
+We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.


Suggested change

We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.

We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using `Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()`) plus tracked off-heap memory in `TreeMemoryConsumer`. If an allocation exceeds this total committed memory, the allocation fails and triggers an OOM.

steveburnett · 2025-08-19T19:01:31Z

+
+We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.
+
+With this change, the "quota check" is performed when an allocation in the native engine is informed to Gluten. In practice, this means the Java codebase can oversubscribe memory within the on-heap quota, even if off-heap usage is sufficient to fail the allocation.


"is informed to " - can you find another way to describe this? I'm not sure from this phrasing what happens here.
Maybe:
"is sent to Gluten"
"is performed when Gluten receives an allocation request"
or something else you can think of.

zhli1142015

Thanks.

zhli1142015 · 2025-08-21T06:22:47Z

+- If memory is insufficient, Spark performs garbage collection (GC) to free on-heap memory, allowing Velox to allocate memory.
+- If memory remains insufficient after GC, Spark reports an out-of-memory (OOM) issue.
+
+We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.


I think we use only Runtime.getRuntime().totalMemory() for the calculation here.

zhli1142015

Thanks.

Signed-off-by: Yuan <yuanzhou@apache.org>

[VL] Adding doc for dynamic offheap sizing feature

ed0538c

Signed-off-by: Yuan Zhou <yuan.zhou@ibm.com>

github-actions bot added the DOCS label Apr 22, 2025

zhouyuan changed the title ~~[VL] Adding doc for dynamic offheap sizing feature~~ [GLUTEN-9344][VL] Adding doc for dynamic offheap sizing feature Apr 22, 2025

srinivasst reviewed May 9, 2025

View reviewed changes

github-actions bot added the stale stale label Jun 27, 2025

zhouyuan removed the stale stale label Jun 27, 2025

zhouyuan marked this pull request as ready for review July 10, 2025 12:33

zhouyuan requested a review from zhli1142015 July 10, 2025 12:33

philo-he approved these changes Aug 16, 2025

View reviewed changes

zhouyuan added 3 commits August 17, 2025 12:53

refine

1606cfb

Signed-off-by: Yuan <yuanzhou@apache.org>

adding nav order

25c2174

Signed-off-by: Yuan <yuanzhou@apache.org>

refine

630e0ba

Signed-off-by: Yuan <yuanzhou@apache.org>

steveburnett reviewed Aug 19, 2025

View reviewed changes

zhli1142015 reviewed Aug 21, 2025

View reviewed changes

philo-he changed the title ~~[GLUTEN-9344][VL] Adding doc for dynamic offheap sizing feature~~ [GLUTEN-9344][VL] Document dynamic offheap sizing feature Aug 21, 2025

address comments

4060fa4

Signed-off-by: Yuan <yuanzhou@apache.org>

zhouyuan merged commit f643e06 into apache:main Aug 23, 2025
3 checks passed

		@@ -0,0 +1,24 @@
		## Dynamic Off-heap Sizing
		Spark requires setting both on-heap and off-heap memory sizes, which initializes different memory layouts. Improper configuration of these settings can lead to lower performance. Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.


		## Limitations

		This feature is still under heavy development. No newline at end of file

	To fix this issue, Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.
	To fix this issue, dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.


		In general, the feature works as follows:

		- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.

	We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.
	We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using `Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()`) plus tracked off-heap memory in `TreeMemoryConsumer`. If an allocation exceeds this total committed memory, the allocation fails and triggers an OOM.


		We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.

		With this change, the "quota check" is performed when an allocation in the native engine is informed to Gluten. In practice, this means the Java codebase can oversubscribe memory within the on-heap quota, even if off-heap usage is sufficient to fail the allocation.

Conversation

zhouyuan commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FelixYBW commented May 12, 2025

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

philo-he left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhli1142015 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhli1142015 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhouyuan commented Apr 22, 2025 •

edited

Loading