[GLUTEN-5320][VL] Reduce driver memory footprint by postpone the creation and serialization of LocalFilesNode#5321
Conversation
|
Run Gluten Clickhouse CI |
|
There are still some cases to fix, for example:
|
| import java.lang.{Long => JLong} | ||
| import java.nio.charset.StandardCharsets | ||
| import java.time.ZoneOffset | ||
| import java.util |
There was a problem hiding this comment.
Not introduce this package. Just use JArrayList.
| public List<String> preferredLocations() { | ||
| return Arrays.asList(filePartition.preferredLocations()); | ||
| } |
There was a problem hiding this comment.
val preferredLocations =
SoftAffinity.getFilePartitionLocations(f)
please keep origin logic.
| numOutputBatches: SQLMetric, | ||
| scanTime: SQLMetric): RDD[ColumnarBatch] | ||
|
|
||
| def toLocalFilesNodeByteArray(p: GlutenRawPartition): Array[Array[Byte]] |
There was a problem hiding this comment.
could we add a new SplitInfo object file and move this method into it with toSplitInfoByteArray? then other backends could use it more easily, and avoid add this method in IteratorApi which seems unrelated.
|
thank you for the improvements, this idea works for me, just few comments. |
|
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
|
This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks. |
|
@WangGuangxin are you still working on this PR? |
@Yohahaha I'll rework on this this week. |
|
Hi @WangGuangxin Feel free to request close my PR if yours is ready to review. |
What changes were proposed in this pull request?
Currently, driver generate
GlutenPartitionbased on spark'sFilePartitions, and then convert toLocalFilesNodeand serialized to byte array in pb format.This will double the driver memory, because the
FilePartitionsare not destroyed after convert toLocalFilesNodes.When there are many file splits ( file status) , the impact is significant.
For example, in one of our case, there are total 48 hdfs paths to list, 7039474 files under them. With vanilla spark, it can work with driver memory = 20G, but failed in Gluten.
From the gc log, we can find that Gluten has more
StringandByte[]objects than vanilla spark.Vanilla Spark Full GC objects
Gluten Full GC objects (before this patch)
Gluten Full GC objects (after this patch)
(Fixes: #5320)