Process five symbols per stream per iteration on AArch64.#3299
Process five symbols per stream per iteration on AArch64.#3299DivineOb wants to merge 1 commit intofacebook:devfrom
Conversation
terrelln
left a comment
There was a problem hiding this comment.
Sorry for the delay!
We have to retain support for the Huffman table log = 12. Which means that we can't blindly decode 5 symbols per loop (5 * 12 = 60, but we are only guaranteed to have 57 bits in our bitstream). The ASM implementation is only called when tableLog == 11, so it is allowed to make that assumption.
The Zstandard format doesn't actually allow tableLog=12, but we had a bug in our dictionary builder in an early version that could potentially emit tableLog=12. So we want to retain that support.
Rather than just adding a 5th symbol to the loops, I'd likely re-write the decoding loop to use a similar approach to the assembly, somewhat like #3155.
I am going to close this PR in favor of writing an optimized C version of the Huffman decoder in Issue #3425.
Thanks for the PR!
|
This is handled in PR #3449. I'd be happy to accept any patches to the fast C decoder that improve aarch64 performance. |
Decode five symbols per stream per iteration in X1 huffman decompression on AArch64 rather than the default 4. The x86 assembly version already implements this change. Doing so gives a modest decompression speedup on Neoverse N1. Because the portion of runtime used by huffman compression is small this represents a significant speedup to those functions.
gcc: 11.2.0
clang: 14.0.6-2
Tests: silesia.tar
Platform: Neoverse N1