[GLUTEN-9163][VL] Use stream de/compressor in sort-based shuffle#9278
[GLUTEN-9163][VL] Use stream de/compressor in sort-based shuffle#9278marin-ma merged 15 commits intoapache:mainfrom
Conversation
79ecfbf to
abf4ddb
Compare
57ec2b9 to
1b762f3
Compare
|
@j7nhai I can see the compression size is reduced with the streaming compression. Could you try this patch for your case? |
@j7nhai That's not true. We didn't use compressed os because the spilled data has been compressed before writing to the output stream. This patch modifies the compression logic, removes that compression and uses the compressed os to do the work. |
It works much better now in my case. In some case, number of shuffle byte written is less than vanllina spark. |
hi @j7nhai thanks for reporting back! other than the shuffle data size, have you also notice some performance improvement on shuffle operator with this patch? thanks, -yuan |
zhouyuan
left a comment
There was a problem hiding this comment.
👍
The code is clean, I can understand the logic here is to using compressing stream instead of iostream + compression. so the benefits may come from less metadata written?
| return getBufferSize(buffers_); | ||
| } | ||
|
|
||
| bool InMemoryPayload::mergable() const { |
There was a problem hiding this comment.
nice refactor! typo mergeable ?
|
|
||
| // TODO: Support setting chunk size | ||
| // Write 64 KB compressed data at a time | ||
| static const int64_t kChunkSize = 64 * 1024; |
There was a problem hiding this comment.
comparing with the original method, will this bring higher memory footprint?
There was a problem hiding this comment.
We should respect spark.io.compression.lz4.blockSize and spark.io.compression.zstd.bufferSize when creating this buffer. Both have a default value of 32k. I will create a followup pr to do some refactoring and make this configurable.
| } | ||
|
|
||
| RETURN_NOT_OK(FinalizeCompression()); | ||
| ARROW_ASSIGN_OR_RAISE(compressor_, codec_->MakeCompressor()); |
There was a problem hiding this comment.
Can we move this compressor out of the shufflestream and reuse it?
There was a problem hiding this comment.
We cannot reuse the compressor outside of the stream. One output stream is held by one spiller, and the compressor is re-created each time the spiller receive a new partition to spill. Therefore the compressor can be ended and then recreated multiple times when writing one output stream.
|
@GlutenPerfBot benchmark velox |
| namespace gluten { | ||
|
|
||
| namespace { | ||
| constexpr uint64_t kMaxReadBufferSize = (1 << 20) - AlignedBuffer::kPaddedSize; |
There was a problem hiding this comment.
Maybe we can make this configurable? @marin-ma
For cases with huge data, this may not be enough
|
It seems this patch makes core dump. Error message: Conf: |
|
@wangyum Thanks for reporting! Is it convenient for you to share the shuffle schema also? It looks like TPCH/DS based tests are not able to trigger the issue in our testing env. |
|
It seems I can reproduce this issue: create table t_decimal_tt11 as
SELECT Cast(id AS DECIMAL(18, 0)) id
FROM Range(10000000);
create table t_decimal_tt12 as
SELECT Cast(id AS DECIMAL(18, 0)) id
FROM Range(2103627078L);
insert into t_decimal_tt12 values(-998), (-999), (-1), (-10), (-9999), (-1003), (-2), (-1005), (-1006), (-1007), (-1008), (-1009);
SELECT Count(*)
FROM t_decimal_tt11 t_decimal_tt1
LEFT JOIN t_decimal_tt12 t_decimal_tt2
ON t_decimal_tt1.id = t_decimal_tt2.id;
|
…che#9278) Change-Id: Ie908909977e9a78b3b81ae738193de929fa58dbc
For shuffle writer tasks with heavily spill, streaming compression should provide better compression ratio. Currently, spilled data is compressed before writing to the output stream. This patch modifies the compression logic for sort-based shuffle, removes that compression and uses the compressed output stream to do the work.