Skip to content

[VL] Failing allocation in DynamicOffHeapSizingMemoryTarget #7605

@wenwj0

Description

@wenwj0

Backend

VL (Velox)

Bug description

I am trying to set spark.gluten.memory.dynamic.offHeap.sizing.enabled=true", but OOM exception occurs.
spark configuration :

spark.executor.memory=4g;
spark.executor.memoryOverhead=1G;
spark.gluten.memory.dynamic.offHeap.sizing.enabled=true
spark.memory.offHeap.enabled=true

and web ui shows:

spark.gluten.memory.conservative.task.offHeap.size.in.bytes=597059174
spark.gluten.memory.offHeap.size.in.bytes=2388236697
spark.gluten.memory.task.offHeap.size.in.bytes=597059174
spark.memory.offHeap.size=2388236697

And I got the OOM exception:

Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. Try tweaking config option spark.memory.offHeap.size to get larger space to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not enabled). 
Current config settings: 
    spark.gluten.memory.offHeap.size.in.bytes=3.4 GiB
    spark.gluten.memory.task.offHeap.size.in.bytes=876.6 MiB
    spark.gluten.memory.conservative.task.offHeap.size.in.bytes=876.6 MiB
    spark.memory.offHeap.enabled=true
    spark.gluten.memory.dynamic.offHeap.sizing.enabled=true
Memory consumer stats: 
    Task.52:                                             Current used bytes: 104.0 MiB, peak bytes:        N/A
    \- Gluten.Tree.0:                                    Current used bytes: 104.0 MiB, peak bytes:  112.0 MiB
       \- root.0:                                        Current used bytes: 104.0 MiB, peak bytes:  112.0 MiB
          +- CelebornShuffleWriter.0:                    Current used bytes:  48.0 MiB, peak bytes:   48.0 MiB
          |  \- single:                                  Current used bytes:  48.0 MiB, peak bytes:   48.0 MiB
          |     +- gluten::MemoryAllocator:              Current used bytes:  28.8 MiB, peak bytes:   29.0 MiB
          |     \- root:                                 Current used bytes:   4.2 MiB, peak bytes:   15.0 MiB
          |        \- default_leaf:                      Current used bytes:   4.2 MiB, peak bytes:   14.1 MiB

It may cause by:

24/10/18 17:16:00 WARN org.apache.gluten.memory.memtarget.DynamicOffHeapSizingMemoryTarget: "Failing allocation as unified memory is OOM. Used Off-heap: 406847480, Used On-Heap: 2021017784, Free On-heap: 1796847432, Total On-heap: 3817865216, Max On-heap: 2388236697, Allocation: 8388608."
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "Memory used in task 11"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "Acquired by org.apache.gluten.memory.memtarget.spark.TreeMemoryConsumer@182a8cbe: 104.0 MiB"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "0 bytes of memory were used by task 11 but are not associated with specific consumers"
24/10/18 17:16:00 INFO org.apache.spark.memory.TaskMemoryManager: "406847480 bytes of memory are used for execution and 1129714 bytes of memory are used for storage"

As we can see above, the used off-heap memory is only 406847480(388MiB), while my off-heap configuration is 2.2GiB.
Why will throw OOM exception?

Spark version

Spark-3.2.x

Spark configurations

No response

System information

No response

Relevant logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions