Add fast huf_dec with generic C and tuned aarch64 assembly#3155
Add fast huf_dec with generic C and tuned aarch64 assembly#3155JunHe77 wants to merge 2 commits intofacebook:devfrom
Conversation
The is C version of the fast decompression algorithm implemented in huf_decompress_amd64.S. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I964b109f4fd7fc9ca256b280e9add37c84f2e597
This is based on the fast HUF_4x1 decoding firstly introduced by Nick Terrell. It is manually tuned to balance performance across various Arm micro-architectures including N1/A72/A57. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I2de7afd44a4b43cfbedc80747aef4a36c6ae35eb
|
Whats the difference between the C version and the ASM version? It looks like the gains for the aarch64 version are smaller than the gains for the x86-64 version, which could make sense because aarch64 has way less register pressure, and that was the main constraint on x86-64. For those smaller gains, I'd be a bit hesitant about merging an ASM implementation. But would be more open to a generic C version. |
|
Thanks for the comments, @terrelln . I understand the maintenance effort for an assembly implementation. Following is the change comparison between C and ASM for silesia at L8. Pls check.
In 5 of 12 cases, ASm version achieves 1%+ better performance (~2% for sao on both N1 and A72). |
|
Is the C version the current zstd code, or the fast decode C you've written? If it is the latter, what is the difference between zstd's code and the tuned C? |
|
The C version is a kind of "rewritten asm in C" of huf_decompress_amd64.S. I started the porting on Arm with C, then hand tuned asm version for both 4x1 and 4x2. |
|
Hi @terrelln, anything I need to follow for this PR? Thanks. |
terrelln
left a comment
There was a problem hiding this comment.
Thanks for the PR @JunHe77, and sorry for the delay in review!
I'm working on a PR for C decoders, similar to what you added here. I was able to eek out a bit more speed (on x86-64, will measure aarch64 as well), and clean up the code a little bit.
I will put my PR up next week, at which point please feel free to benchmark your C / aarch64 ASM implementations against the PR. If there are meaningful gains, we can look into merging.
|
Well received, @terrelln . Will do the benchmark once your PR is ready. Thanks. |
|
Hi @JunHe77, I've just merged PR #3449. I've found that on my M1 chip those loops perform at least as well as your C versions. But I would be happy to accept patches to the fast C loops if you can find a faster variant. If the assembly implementation is still significantly faster than the fast C variant, I would be willing to accept the aarch64 assembly implementation. But it would have to be disabled by default, and require the definition of a macro in the build process The reason it would have to be disabled by default, is that we have no way to continuously fuzz the code. All of our fuzzers run on x86-64 and i386. We want to minimize the amount of code that isn't fuzzed in zstd. I trust that your assembly is correct, but I don't feel comfortable including any non-trivial code in zstd that isn't fuzzed. If we merge it disabled by default, and later oss-fuzz adds support for aarch64, we could start fuzzing that code and switch it to enabled by default. |
This includes implementations of a generic C version of fast decode and a tuned 4x1 assembly version for Arm.
For silesia, observed 3.9% for sao, ~2% for mozilla/ooffice/osdb/x-ray.
As the author of the original algorithm, could you pls help to review this, @terrelln ? Thanks a lot.