Skip to content

[GLUTEN-9344][VL] Document dynamic offheap sizing feature#9391

Merged
zhouyuan merged 5 commits intoapache:mainfrom
zhouyuan:wip_dynamic_offheap_doc
Aug 23, 2025
Merged

[GLUTEN-9344][VL] Document dynamic offheap sizing feature#9391
zhouyuan merged 5 commits intoapache:mainfrom
zhouyuan:wip_dynamic_offheap_doc

Conversation

@zhouyuan
Copy link
Copy Markdown
Member

@zhouyuan zhouyuan commented Apr 22, 2025

What changes were proposed in this pull request?

this patch adds documentation for dynamic off-heap sizing feature
fixes: #9344

How was this patch tested?

no need to test

Signed-off-by: Yuan Zhou <yuan.zhou@ibm.com>
@github-actions github-actions bot added the DOCS label Apr 22, 2025
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@zhouyuan zhouyuan changed the title [VL] Adding doc for dynamic offheap sizing feature [GLUTEN-9344][VL] Adding doc for dynamic offheap sizing feature Apr 22, 2025
@github-actions
Copy link
Copy Markdown

#9344

- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.
- When Velox tries to allocate memory, Gluten attempts to allocate from system memory and records this in the memory allocator.
- If there is sufficient memory, allocations proceed normally.
- If memory is insufficient, Spark performs garbage collection (GC) to free on-heap memory, allowing Velox to allocate memory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In gluten-core/src/main/java/org/apache/gluten/memory/memtarget/DynamicOffHeapSizingMemoryTarget.java the borrow code only checks for uncommitted memory (size + usedOffHeapMemory + totalHeapMemory <= TOTAL_MEMORY_SHARED).

It does not acquire memory from the TaskMemoryManager

How will this trigger spill / GC?

@FelixYBW
Copy link
Copy Markdown
Contributor

#9588

@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Jun 27, 2025
@zhouyuan zhouyuan removed the stale stale label Jun 27, 2025
@zhouyuan zhouyuan marked this pull request as ready for review July 10, 2025 12:33
@zhouyuan zhouyuan requested a review from zhli1142015 July 10, 2025 12:33
Copy link
Copy Markdown
Member

@philo-he philo-he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

@@ -0,0 +1,24 @@
## Dynamic Off-heap Sizing
Spark requires setting both on-heap and off-heap memory sizes, which initializes different memory layouts. Improper configuration of these settings can lead to lower performance. Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: recommend to break this into multiple code lines, which is more readable for editors not wrapping text visually. And it can also improve the readability in diffs.
Ditto for other changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Limitations

This feature is still under heavy development. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: ** is still in early development.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refined

Signed-off-by: Yuan <yuanzhou@apache.org>
Signed-off-by: Yuan <yuanzhou@apache.org>
Signed-off-by: Yuan <yuanzhou@apache.org>
Copy link
Copy Markdown

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions of formatting and phrasing.

## Dynamic Off-heap Sizing
Gluten requires setting both on-heap and off-heap memory sizes, which initializes different memory layouts. Improper configuration of these settings can lead to lower performance.

To fix this issue, Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To fix this issue, Dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.
To fix this issue, dynamic off-heap sizing is an experimental feature designed to simplify this process. When enabled, off-heap settings are ignored, and Velox uses the on-heap size as the memory size.


In general, the feature works as follows:

- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by spark.executor.memory.
- Spark first attempts to allocate memory based on the on-heap size. Note that the maximum memory size is controlled by `spark.executor.memory`.

- If memory is insufficient, Spark performs garbage collection (GC) to free on-heap memory, allowing Velox to allocate memory.
- If memory remains insufficient after GC, Spark reports an out-of-memory (OOM) issue.

We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.
We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using `Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()`) plus tracked off-heap memory in `TreeMemoryConsumer`. If an allocation exceeds this total committed memory, the allocation fails and triggers an OOM.


We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.

With this change, the "quota check" is performed when an allocation in the native engine is informed to Gluten. In practice, this means the Java codebase can oversubscribe memory within the on-heap quota, even if off-heap usage is sufficient to fail the allocation.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is informed to " - can you find another way to describe this? I'm not sure from this phrasing what happens here.
Maybe:
"is sent to Gluten"
"is performed when Gluten receives an allocation request"
or something else you can think of.

Copy link
Copy Markdown
Contributor

@zhli1142015 zhli1142015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

- If memory is insufficient, Spark performs garbage collection (GC) to free on-heap memory, allowing Velox to allocate memory.
- If memory remains insufficient after GC, Spark reports an out-of-memory (OOM) issue.

We then enforce a total memory quota, calculated as the sum of committed and in-use memory in the Java heap (using Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) plus tracked off-heap memory in TreeMemoryConsumer. If an allocation exceeds this total committed memory, the allocation fails, triggering an OOM.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use only Runtime.getRuntime().totalMemory() for the calculation here.

Copy link
Copy Markdown
Contributor

@zhli1142015 zhli1142015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@philo-he philo-he changed the title [GLUTEN-9344][VL] Adding doc for dynamic offheap sizing feature [GLUTEN-9344][VL] Document dynamic offheap sizing feature Aug 21, 2025
Signed-off-by: Yuan <yuanzhou@apache.org>
@zhouyuan zhouyuan merged commit f643e06 into apache:main Aug 23, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] adding documentation of dynamic offheap sizing feature

6 participants