Improve parquet gzip compression performance using zlib-rs#7200
Improve parquet gzip compression performance using zlib-rs#7200mbrobbel merged 5 commits intoapache:mainfrom
Conversation
| snap = { version = "1.0", default-features = false, optional = true } | ||
| brotli = { version = "7.0", default-features = false, features = ["std"], optional = true } | ||
| flate2 = { version = "1.0", default-features = false, features = ["rust_backend"], optional = true } | ||
| flate2 = { version = "1.1", default-features = false, features = ["zlib-rs"], optional = true } |
There was a problem hiding this comment.
Is this pure-rust? Does this compile for wasm32-unknown-unknown?
There was a problem hiding this comment.
Yes. its written in pure rust.
In my fork I can see wasm32 pipeline not failing https://github.com/psvri/arrow-rs/actions/runs/13547936797/job/37864105483
|
It seems like this would be another good reason to 🤔 |
|
I merged this PR up from main to rerun the tests as I think the failing CI check was resoled |
|
Hi @psvri -- thank you for this contribution I noticed that this PR reports benchmark numbers. What benchmark were these? Are they from zilb-rs? Did you run any benchmarks for the parquet crate itself (as in from the https://github.com/apache/arrow-rs/tree/main/parquet/benches directory)? |
|
Hello @alamb I got these numbers by running gzip benchmarks from this file https://github.com/apache/arrow-rs/blob/main/parquet/benches/compression.rs |
alamb
left a comment
There was a problem hiding this comment.
Thank you again @psvri
I also ran the benchmarks on this branch and saw similar improvements as you reported (up to 96% faster)
++ critcmp main improve_parquet_gzip_compressions_2
group improve_parquet_gzip_compressions_2 main
----- ----------------------------------- ----
compress GZIP(GzipLevel(6)) - alphanumeric 1.00 27.4±0.06ms ? ?/sec 1.46 40.1±0.21ms ? ?/sec
compress GZIP(GzipLevel(6)) - words 1.00 31.5±0.09ms ? ?/sec 1.55 49.0±0.09ms ? ?/sec
decompress GZIP(GzipLevel(6)) - alphanumeric 1.00 3.0±0.01ms ? ?/sec 1.12 3.4±0.01ms ? ?/sec
decompress GZIP(GzipLevel(6)) - words 1.00 1275.7±5.24µs ? ?/sec 1.96 2.5±0.01ms ? ?/sec
Let's hold this one until the next major parquet release (eta in April) to minimize the chance of disruption. I think we can merge the PR after I cut the next maintenance release in a few days
|
The MSRV action is failing since I haven't updated the rust version from 1.70 to 1.75 in my branch. Would it be possible to update MSRV in the next major version ? |
I think we should I think we are blocked on someone writing down a policy. |
…com/psvri/arrow-rs into improve_parquet_gzip_compressions_2
|
🎉 -- thank you so much @psvri for sticking with this -- the crate is better all around because of it |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #.
Rationale for this change
We will use zlib-rs for delfate operations which has much better performance than the current one. I can see ~10%-47% performance improvement in various scenarios
perf numbers
What changes are included in this PR?
I have updated the flate library to use zlib-rs backend. This does mean that we need to bump our MSRV to 1.75 . So I dont expect the PR to merged immediately until we sort out #181
Also we allow gzip level 10 in our parquet implementation , which is non complaint gzip level as explained here . Hence I have also changed max gzip level to 9.
Are there any user-facing changes?
Yes, max gzip level is now 9 in parquet.