Skip to content

[CORE] Defer Protobuf serialization of SplitInfos in GlutenPartitions#10662

Merged
marin-ma merged 1 commit intoapache:mainfrom
kevinwilfong:serialize_split
Sep 12, 2025
Merged

[CORE] Defer Protobuf serialization of SplitInfos in GlutenPartitions#10662
marin-ma merged 1 commit intoapache:mainfrom
kevinwilfong:serialize_split

Conversation

@kevinwilfong
Copy link
Copy Markdown
Collaborator

@kevinwilfong kevinwilfong commented Sep 9, 2025

What changes are proposed in this pull request?

Today the GlutenPartition objects contain an array of byte arrays which are the Protobuf serialized ReadRel.read_type objects from the SplitInfos. The GlutenPartitions are Java serialized and sent to the Executors responsible for their respective Tasks. Looking through the code it appears we Protobuf serialize the SplitInfos so we can easily pass them across the JNI boundary.

We see the serialized SplitInfos can consume a significant amount of memory in the Driver, this is because as SplitInfo objects their state can share references tot he same objects, but once serialized they share nothing, which explodes their size in memory.

If we Java serialize the SplitInfo objects like the rest of the GlutenPartition state, and Protobuf serialize them as part of the Task, this can significantly save driver memory. The cost is a little additional memory in the Task, the size of the SplitInfo objects for a single GlutenPartition which should be trivial, and a little additional CPU instead of Protobuf serializing in the Driver and Java serializing the array of byte arrays, we Java serialize the array of SplitInfos, and on the Task we pay the additional cost of Java deserializing an array of SplitInfos and Protobuf serializing them, overall the difference is just the additional cost of Java serializing the SplitInfos instead of byte arrays.

How was this patch tested?

Ran the existing unit tests.

Verified locally that a query with particularly high Driver memory due to serialized SplitInfos saw a significant reduction.

@github-actions github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Sep 9, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 9, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@kevinwilfong kevinwilfong marked this pull request as ready for review September 10, 2025 23:36
@kevinwilfong
Copy link
Copy Markdown
Collaborator Author

kevinwilfong commented Sep 10, 2025

It looks like tests are failing due to #10671 (not related to this change)

Copy link
Copy Markdown
Contributor

@marin-ma marin-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!
Just one question regarding the solution of this performance issue: Does it mean the driver memory can be decreased with this patch because java serialisation only serialise the same object only once?

@kevinwilfong
Copy link
Copy Markdown
Collaborator Author

kevinwilfong commented Sep 12, 2025

@marin-ma Thanks for the review!

Does it mean the driver memory can be decreased with this patch because java serialisation only serialise the same object only once?

I suspect the reason is that Spark Java serializes the GlutenPartitions as needed and does not hold the serialized values in memory for a long time. In Gluten, we're currently Protobuf serializing the SplitInfos when we create the GlutenPartitions, and I see a large number of these GlutenPartitions getting held in the Driver's memory while the query is running, so the serialized SplitInfos all exist together at the same time. If Spark is Java serializing the GlutenPartitions only when a Task is ready to execute, and evicts the serialized value from memory as soon as it's been sent to the Executor, with this change we'll only end up with a relatively small number of serialized values present in the Driver's memory at the same time (proportional to the number of Executors).

@marin-ma marin-ma merged commit 0d170be into apache:main Sep 12, 2025
52 of 56 checks passed
kevinwilfong added a commit to kevinwilfong/incubator-gluten that referenced this pull request Sep 12, 2025
@Yohahaha
Copy link
Copy Markdown
Contributor

#6572

We have another pr to decrease driver memory pressure, just post here to see if we can apply.

wForget pushed a commit to wForget/gluten that referenced this pull request Sep 23, 2025
@wForget wForget mentioned this pull request Sep 23, 2025
wForget pushed a commit to wForget/gluten that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants