RISC-V RVV: Boost Chameleon to 494 MB/s, Fix Tests WIP#96
RISC-V RVV: Boost Chameleon to 494 MB/s, Fix Tests WIP#96Dayuxiaoshui wants to merge 9 commits intog1mv:devfrom
Conversation
- Implemented `encode_batch` in `Chameleon` (QuadEncoder) using RISC-V Vector (RVV) intrinsics to vectorize hash computation, dictionary gather, and comparison. - Updated `encode_block` in `Codec` trait to use `encode_batch` for aligned u32 quads, with scalar fallback for prefix/suffix. - Added conditional compilation for RVV (`#[cfg(all(target_arch = riscv64, target_feature = v))]`) with scalar fallback for non-RVV environments. Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
|
Hello! I had a look in your code, all in all it sounds good 👍 |
| for &quad in quads { | ||
| self.encode_quad(quad, out_buffer, signature); | ||
| } | ||
| return; |
There was a problem hiding this comment.
In my tests during development, I found that splitting the block into 128-bit sub-blocks and then extracting 4 quads using shift-masking was faster than iterating over each quad as you do here - see the initial codec.rs file for that:
for sub_block in block.chunks(BYTE_SIZE_U128) {
match <&[u8] as TryInto<[u8; BYTE_SIZE_U128]>>::try_into(sub_block) {
I suspect this is because the Rust compiler could easily optimize the code (sequential self.encode_quad() calls).
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
|
Current Situation: We have performed isolation operations, but there are still some test cases that fail to pass. This may be due to compatibility issues caused by specific environments or configurations. Based on the current situation and community suggestions, we plan to create a new branch to isolate these issues and perform targeted optimizations. This will help maintain the stability of the main branch while allowing testing and iteration in specific environments. If you are using a RISC-V environment and want to switch to the branch optimized for RISC-V, please execute the following command git checkout rvv |
| } | ||
| } | ||
|
|
||
| self.encode_batch(u32_block, out_buffer, signature); |
There was a problem hiding this comment.
Upon further review, I think that there is a problem here that would explain all the CI errors.
Concept-wise, the method encode_quad() does not check at all for signature status, it assumes the calling code is handling all signature integrity functions.
In your PR's code, in Chameleon, when you call encode_batch(u32_block...), you must make sure that every 64 quads (or 256 bytes) a new signature data structure is created appropriately (every 64 quads because for this algo the underlying signature is stored in a u64). As u32_block can have any given length, there is a signature management problem here. The same applies for prefix and suffix encoding.
To summarize, all functions calling encode_quad() must assert whether the current signature is complete (all 64 bits have been used in the Chameleon case), and if that is the case, create a new one. This is implicitly done in the current main branch with the following code:
for block in input.chunks(Self::block_size()) {
self.encode_block(block, &mut out_buffer, &mut signature, &mut protection_state);
}
By chunking into Self::block_size() blocks (256 bytes for Chameleon), signature integrity is handled appropriately as encode_block() starts by creating a new signature:
signature.init(out_buffer.index);
out_buffer.skip(Self::signature_significant_bytes());
There was a problem hiding this comment.
Issue Report: Algorithm Compatibility Issue in Non-RVV Environments & Request for Community Help
1. Core Issue Status
The current code has an environment-specific failure:
- ✅ RVV Environment: All algorithms (chameleon/cheetah/lion) run normally;
- ❌ Non-RVV Environment: Only
chameleon(withflag_size_bits=1) works correctly. Tests forcheetah(withflag_size_bits=2) andlion(withflag_size_bits=3) fail, manifesting as garbled output bytes (redundant PLAIN data on the left, low signature values).
2. Identified Root Cause
The core problem lies in u64 signature bit overflow and lack of environment adaptation logic:
- Signature Storage Limit Constraint: The
signatureis implemented based on u64 (supporting a maximum of 64 bits), while the total number of bits frompush_bitsequalsflag_size_bits × number of quads in a block. For cheetah and lion, if there are too many quads in a single block, the accumulated bits will far exceed 64 bits, directly causing signature overflow/incompleteness and further breaking the decoding logic. - Differences in Environment Adaptation:
- The normal operation in the RVV environment is a "coincidental adaptation": The current block size for RVV batch processing happens to avoid the bit overflow scenario when
flag_size_bits>1—this does not mean the overflow issue is logically resolved; - The failure in the non-RVV environment is an "inevitable result": After optimization, the non-RVV environment still uses a fixed
block_size(referencing the original 256 bytes) and does not dynamically reduce the block size based onflag_size_bits. This makes the overflow issue explicit whenflag_size_bits>1(the original code implicitly avoided this problem viachunks(block_size), but the optimized batch logic breaks this constraint).
- The normal operation in the RVV environment is a "coincidental adaptation": The current block size for RVV batch processing happens to avoid the bit overflow scenario when
3. Attempted Solutions & Limitations
To address this issue, I previously tried the approach of "dynamically adjusting block size", as detailed below:
- Added 2 trait functions:
flag_size_bits()(exposes the flag bit count of the algorithm) anddecode_twin_flag_mask_bits()(provides the bit count of the dual-flag mask); - Adjusted the
block_sizecalculation logic: Derived the block size based onDECODE_TWIN_FLAG_MASK_BITSto ensure the totalpush_bitsfor a single block ≤ 64 bits, and that the block size is an integer multiple ofdecode_unit_size(to avoid non-integer iteration counts).
However, this solution failed to resolve the anomaly in non-RVV environments. After further investigation, the key bottleneck remains unidentified—I suspect there may be a hidden conflict between the batch processing logic in non-RVV environments and the dynamically adjusted block size. Unfortunately, due to my limited depth of understanding of the overall code flow, I cannot currently locate the specific problem location.
4. Request for Community Assistance
Since I am currently stuck at the critical node of "dynamic block size adaptation in non-RVV environments", I hope to leverage the community’s experience and perspectives to overcome this challenge together:
- If any community members have encountered similar issues related to "u64 signature overflow + environment-specific adaptation", could you share your troubleshooting ideas or solutions?
- Regarding "how to ensure compatibility between the adjusted
block_sizeand the logic ofencode_batch/encode_quadduring batch processing in non-RVV environments", are there any better design approaches? - Currently, it is unclear whether there are other implicit constraints in the batch logic of non-RVV environments that affect the effectiveness of block size adjustments. If any community members are familiar with this part of the code, could you help sort out the logic flow together?
Going forward, I will continue to organize the call stack and log details of the non-RVV environment to supplement more debugging information. Thank you all in the community for your help—I look forward to resolving this issue together and enabling the vector compression feature to cover more environment scenarios! 💪
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
|
Hi @Dayuxiaoshui ! density v0.16.6 main: your PR: I suspect this has to do with initially-optimized code that has been modified (see #96 (comment)). |
PR: RISC-V RVV Optimization for density-rs - Initial Results and Request for Review
Hi @g1mv and community members, thanks for the previous feedback! 🙌
I've completed the initial optimization work on
density-rs, focusing on RISC-V with RVV vector extensions. Below, I'll first summarize what I've done, then explain the current issues, and invite everyone to review the results and provide feedback! 😄What I've Done
Based on your suggestions and discussions, I prioritized optimizing the Chameleon algorithm's core loops (e.g.,
encode_quadandencode_batch), and extended similar improvements to Cheetah and Lion. The optimizations include:Manual RVV Vectorization:
encode_batch, used RVV intrinsics (e.g.,vle32_v_u32m1,vmul_vx_u32m1,vsrl_vx_u32m1,vluxei32_v_u32m1, andvmseq_vv_m_b32) for hash calculations, dictionary accesses, and conflict detection.#[cfg(all(target_arch = "riscv64", target_feature = "v"))]for compatibility, and handled VLEN variability withvsetvli.Algorithm Improvements:
dickens.txt(10.19 MB), comparing before and after performance (default vs optimized).Performance Comparison:
Using median throughput (MB/s), compression ratios unchanged:
Key Achievements: Chameleon compression nearing 500 MB/s goal! 🎯 Overall compression speeds improved significantly, but decompression varied slightly (minor drops in Cheetah and Lion).
Code Cleanup:
BYTE_SIZE_U128andstd::arch::riscv64::*remain (to be fixed in PR).These changes build on your feedback (e.g., dynamic vectorization ideas and architectural preferences) and ensure cross-platform adaptability.
Current Issues
Last night, I ran benchmarks (throughput rates), and the performance data looks good as shown above. However, when running unit tests (
cargo test), all three tests (tests::chameleon,tests::cheetah,tests::lion) failed due to decoded output not matching input (assertionleft == rightfailed). For example, in the Chameleon test, byte arrays differ significantly, likely a bug I introduced in RVV vectorization (e.g., improper handling of dictionary update sequencing on hash conflicts).Unfortunately, my lab's SG2044 server (RISC-V environment) is under maintenance, so I can't debug or test further today (e.g., with
RUST_BACKTRACE=1or on real hardware). This delays final PR polishing, but I plan to fix it tomorrow or once the server is back.Request for Review
I've pushed the code to the branch dev (link: [https://github.com/g1mv/density/tree/dev]). Please take a look at the results and provide a review! 🔍 Especially:
encode_batch)?The PR will include optimized code, benchmark data, and docs. Once the server is up, I'll fix the tests and update. Thanks for your support! 😄 Let's perfect this RISC-V compression powerhouse together! 🚀
Have a great day,