Skip to content

Conversation

@rbalamohan
Copy link
Contributor

PR to use snappy as default compression for parquet instead of gzip. #5658

@sumeetgajjar
Copy link
Contributor

org.apache.iceberg.TestSplitScan > test[format = parquet] FAILED

Hi @rbalamohan, the above test is failing due to the change in compression codec, the size of the data file is now ~ 77MB and with a 16MB split size, it results in 5 scan tasks.


Reducing the number of records to 2000000 to get the same file size when gzip was used resolves the test failure.

@github-actions github-actions bot added the data label Aug 30, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 31, 2022

From the fairly broad testing that I've done, snappy is never a good choice for compression. This choice depends on what you want to optimize. Snappy is often fast, but gets very poor compression rates. LZ4 is a much better choice if you're optimizing for write speed because it is usually faster and smaller than snappy. But if you're optimizing for compression, I probably wouldn't choose LZ4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants