You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Distinct aggregation will merge all sorted spill file in getOutput() (SpillPartition::createOrderedReader). If there are too many spill files, reading the first batch of each file into memory will consume a significant amount of memory. In one of our internal cases, one task generated 300 spill files, which requires close to 3G of memory.
Backend
VL (Velox)
Bug description
Distinct aggregation will merge all sorted spill file in
getOutput()(SpillPartition::createOrderedReader). If there are too many spill files, reading the first batch of each file into memory will consume a significant amount of memory. In one of our internal cases, one task generated 300 spill files, which requires close to 3G of memory.Possible workarounds:
kMaxSpillRunRows,1Mwill generate too many spill files for hundreds million rows of input. [GLUTEN-7249][VL] Lower default overhead memory ratio and spill run size #7531kSpillWriteBufferSizeto1Mor lower. Why it is set to 4M by default? Is there any experience in performance tuning?Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response