Skip to content

[GLUTEN-9163][VL] Separate compression buffer and disk write buffer configuration#9356

Merged
marin-ma merged 3 commits intoapache:mainfrom
marin-ma:shuffle-compression-config
Apr 23, 2025
Merged

[GLUTEN-9163][VL] Separate compression buffer and disk write buffer configuration#9356
marin-ma merged 3 commits intoapache:mainfrom
marin-ma:shuffle-compression-config

Conversation

@marin-ma
Copy link
Copy Markdown
Contributor

@marin-ma marin-ma commented Apr 17, 2025

A follow-up to #9278

spark.shuffle.spill.diskWriteBufferSize is used for setting the buffer size before spill to store the sorted rows. Spiller will write the data in this buffer to the output stream.

spark.io.compression.lz4.blockSize,spark.io.compression.zstd.bufferSize are used to set the compression buffer size in the compressed output stream, depending on which compression codec is set.

The memory allocation of two buffers in spark are counted into overhead memory, so we use arrow::default_memory_pool to allocate them.

Add spark.gluten.sql.columnar.shuffle.sort.deserializerBufferSize: Buffer size in bytes for sort-based shuffle reader deserializing raw input to columnar batch.

@github-actions github-actions bot added CORE works for Gluten Core VELOX RSS CLICKHOUSE labels Apr 17, 2025
@github-actions
Copy link
Copy Markdown

#9163

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the shuffle-compression-config branch from 22ad109 to b514a4d Compare April 17, 2025 16:41
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma
Copy link
Copy Markdown
Contributor Author

@zhouyuan Could you help to review? Thanks!

@github-actions github-actions bot added the DOCS label Apr 22, 2025
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma requested a review from zhouyuan April 23, 2025 17:52
@marin-ma
Copy link
Copy Markdown
Contributor Author

@zhouyuan Could you help to review? Thanks!

GlutenShuffleUtils.getSortEvictBufferSize(sparkConf, compressionCodec);
GlutenShuffleUtils.getCompressionBufferSize(sparkConf, compressionCodec);
diskWriteBufferSize =
(int) (long) sparkConf.get(package$.MODULE$.SHUFFLE_DISK_WRITE_BUFFER_SIZE());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is little diffcult to understand, is it necessary to cast to long and then cast to int

Copy link
Copy Markdown
Contributor Author

@marin-ma marin-ma Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark's source code, the configurations are converted in this way. Here's an explanation apache/spark#24187 (comment)

If we don't convert to long first , it will encounter exception like this:
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Copy Markdown
Member

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@marin-ma marin-ma merged commit d077f93 into apache:main Apr 23, 2025
49 checks passed
marin-ma added a commit to marin-ma/gluten that referenced this pull request Jul 16, 2025
warrenzhu25 pushed a commit to warrenzhu25/gluten that referenced this pull request Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants