Skip to content

RISC-V RVV: Boost Chameleon to 494 MB/s, Fix Tests WIP#96

Closed
Dayuxiaoshui wants to merge 9 commits intog1mv:devfrom
Dayuxiaoshui:main
Closed

RISC-V RVV: Boost Chameleon to 494 MB/s, Fix Tests WIP#96
Dayuxiaoshui wants to merge 9 commits intog1mv:devfrom
Dayuxiaoshui:main

Conversation

@Dayuxiaoshui
Copy link
Copy Markdown

PR: RISC-V RVV Optimization for density-rs - Initial Results and Request for Review

Hi @g1mv and community members, thanks for the previous feedback! 🙌

I've completed the initial optimization work on density-rs, focusing on RISC-V with RVV vector extensions. Below, I'll first summarize what I've done, then explain the current issues, and invite everyone to review the results and provide feedback! 😄

What I've Done

Based on your suggestions and discussions, I prioritized optimizing the Chameleon algorithm's core loops (e.g., encode_quad and encode_batch), and extended similar improvements to Cheetah and Lion. The optimizations include:

  1. Manual RVV Vectorization:

    • In encode_batch, used RVV intrinsics (e.g., vle32_v_u32m1, vmul_vx_u32m1, vsrl_vx_u32m1, vluxei32_v_u32m1, and vmseq_vv_m_b32) for hash calculations, dictionary accesses, and conflict detection.
    • Handled hash uniqueness and sequencing: Fall back to scalar paths on conflicts to ensure correct dictionary updates (referencing your "case a and b" analysis).
    • Used conditional compilation #[cfg(all(target_arch = "riscv64", target_feature = "v"))] for compatibility, and handled VLEN variability with vsetvli.
  2. Algorithm Improvements:

    • Reduced branch overhead and memory accesses (e.g., optimized hash multiplication and shifts).
    • Attempted dynamic mode switching (enable non-updating batches when update rate < 0.1), but currently preliminary.
    • Benchmarked with dickens.txt (10.19 MB), comparing before and after performance (default vs optimized).
  3. Performance Comparison:
    Using median throughput (MB/s), compression ratios unchanged:

    Algorithm Operation Before (MB/s) After (MB/s) Change Ratio
    Chameleon Compress (raw) 380.2 494.0 +30% 1.749x
    Decompress (raw) 494.4 503.1 +2%
    Cheetah Compress (raw) 220.8 264.5 +20% 1.860x
    Decompress (raw) 291.4 287.2 -1%
    Lion Compress (raw) 135.3 150.7 +11% 1.966x
    Decompress (raw) 144.9 143.5 -1%
    LZ4 Compress (raw) 82.15 79.26 -3% 1.585x
    Decompress (raw) 174.2 190.5 +9%
    Snappy Compress (stream) 83.69 83.46 -0.3% 1.607x
    Decompress (stream) 141 141.7 +0.5%

    Key Achievements: Chameleon compression nearing 500 MB/s goal! 🎯 Overall compression speeds improved significantly, but decompression varied slightly (minor drops in Cheetah and Lion).

  4. Code Cleanup:

    • Stuck to stable Rust, no external crates.
    • Added runtime fallbacks for non-RVV hardware.
    • Partially fixed warnings, but unused BYTE_SIZE_U128 and std::arch::riscv64::* remain (to be fixed in PR).

These changes build on your feedback (e.g., dynamic vectorization ideas and architectural preferences) and ensure cross-platform adaptability.

Current Issues

Last night, I ran benchmarks (throughput rates), and the performance data looks good as shown above. However, when running unit tests (cargo test), all three tests (tests::chameleon, tests::cheetah, tests::lion) failed due to decoded output not matching input (assertion left == right failed). For example, in the Chameleon test, byte arrays differ significantly, likely a bug I introduced in RVV vectorization (e.g., improper handling of dictionary update sequencing on hash conflicts).

Unfortunately, my lab's SG2044 server (RISC-V environment) is under maintenance, so I can't debug or test further today (e.g., with RUST_BACKTRACE=1 or on real hardware). This delays final PR polishing, but I plan to fix it tomorrow or once the server is back.

Request for Review

I've pushed the code to the branch dev (link: [https://github.com/g1mv/density/tree/dev]). Please take a look at the results and provide a review! 🔍 Especially:

  • Does the performance data make sense? Any suggestions for more test files (e.g., enwik8)?
  • Ideas for optimizing or fixing bugs in RVV implementation (encode_batch)?
  • Analysis of decompression performance drops?
  • Other feedback, like cross-platform testing or documentation improvements.

The PR will include optimized code, benchmark data, and docs. Once the server is up, I'll fix the tests and update. Thanks for your support! 😄 Let's perfect this RISC-V compression powerhouse together! 🚀

Have a great day,

Dayuxiaoshui and others added 6 commits August 20, 2025 15:06
- Implemented `encode_batch` in `Chameleon` (QuadEncoder) using RISC-V Vector (RVV) intrinsics to vectorize hash computation, dictionary gather, and comparison.
- Updated `encode_block` in `Codec` trait to use `encode_batch` for aligned u32 quads, with scalar fallback for prefix/suffix.
- Added conditional compilation for RVV (`#[cfg(all(target_arch = riscv64, target_feature = v))]`) with scalar fallback for non-RVV environments.

Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
@g1mv
Copy link
Copy Markdown
Owner

g1mv commented Aug 21, 2025

Hello!

I had a look in your code, all in all it sounds good 👍
There are a few missing conditional imports (one in lib.rs if I'm not mistaken).
I saw that none of the CI tests pass as of now, but initially let's start by making the nightly work, and then we'll focus on the beta/stable tests.
I'll review your unsafe{} blocks.
Thank you!

for &quad in quads {
self.encode_quad(quad, out_buffer, signature);
}
return;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests during development, I found that splitting the block into 128-bit sub-blocks and then extracting 4 quads using shift-masking was faster than iterating over each quad as you do here - see the initial codec.rs file for that:

for sub_block in block.chunks(BYTE_SIZE_U128) {
match <&[u8] as TryInto<[u8; BYTE_SIZE_U128]>>::try_into(sub_block) {

I suspect this is because the Rust compiler could easily optimize the code (sequential self.encode_quad() calls).

Comment thread src/algorithms/chameleon/chameleon.rs
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
@Dayuxiaoshui
Copy link
Copy Markdown
Author

Current Situation: We have performed isolation operations, but there are still some test cases that fail to pass. This may be due to compatibility issues caused by specific environments or configurations.

Based on the current situation and community suggestions, we plan to create a new branch to isolate these issues and perform targeted optimizations. This will help maintain the stability of the main branch while allowing testing and iteration in specific environments.

If you are using a RISC-V environment and want to switch to the branch optimized for RISC-V, please execute the following command

git checkout rvv

Comment thread src/codec/codec.rs Outdated
}
}

self.encode_batch(u32_block, out_buffer, signature);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further review, I think that there is a problem here that would explain all the CI errors.

Concept-wise, the method encode_quad() does not check at all for signature status, it assumes the calling code is handling all signature integrity functions.

In your PR's code, in Chameleon, when you call encode_batch(u32_block...), you must make sure that every 64 quads (or 256 bytes) a new signature data structure is created appropriately (every 64 quads because for this algo the underlying signature is stored in a u64). As u32_block can have any given length, there is a signature management problem here. The same applies for prefix and suffix encoding.

To summarize, all functions calling encode_quad() must assert whether the current signature is complete (all 64 bits have been used in the Chameleon case), and if that is the case, create a new one. This is implicitly done in the current main branch with the following code:

for block in input.chunks(Self::block_size()) {
    self.encode_block(block, &mut out_buffer, &mut signature, &mut protection_state);
}

By chunking into Self::block_size() blocks (256 bytes for Chameleon), signature integrity is handled appropriately as encode_block() starts by creating a new signature:

signature.init(out_buffer.index);
out_buffer.skip(Self::signature_significant_bytes());

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue Report: Algorithm Compatibility Issue in Non-RVV Environments & Request for Community Help

1. Core Issue Status

The current code has an environment-specific failure:

  • RVV Environment: All algorithms (chameleon/cheetah/lion) run normally;
  • Non-RVV Environment: Only chameleon (with flag_size_bits=1) works correctly. Tests for cheetah (with flag_size_bits=2) and lion (with flag_size_bits=3) fail, manifesting as garbled output bytes (redundant PLAIN data on the left, low signature values).

2. Identified Root Cause

The core problem lies in u64 signature bit overflow and lack of environment adaptation logic:

  1. Signature Storage Limit Constraint: The signature is implemented based on u64 (supporting a maximum of 64 bits), while the total number of bits from push_bits equals flag_size_bits × number of quads in a block. For cheetah and lion, if there are too many quads in a single block, the accumulated bits will far exceed 64 bits, directly causing signature overflow/incompleteness and further breaking the decoding logic.
  2. Differences in Environment Adaptation:
    • The normal operation in the RVV environment is a "coincidental adaptation": The current block size for RVV batch processing happens to avoid the bit overflow scenario when flag_size_bits>1—this does not mean the overflow issue is logically resolved;
    • The failure in the non-RVV environment is an "inevitable result": After optimization, the non-RVV environment still uses a fixed block_size (referencing the original 256 bytes) and does not dynamically reduce the block size based on flag_size_bits. This makes the overflow issue explicit when flag_size_bits>1 (the original code implicitly avoided this problem via chunks(block_size), but the optimized batch logic breaks this constraint).

3. Attempted Solutions & Limitations

To address this issue, I previously tried the approach of "dynamically adjusting block size", as detailed below:

  • Added 2 trait functions: flag_size_bits() (exposes the flag bit count of the algorithm) and decode_twin_flag_mask_bits() (provides the bit count of the dual-flag mask);
  • Adjusted the block_size calculation logic: Derived the block size based on DECODE_TWIN_FLAG_MASK_BITS to ensure the total push_bits for a single block ≤ 64 bits, and that the block size is an integer multiple of decode_unit_size (to avoid non-integer iteration counts).

However, this solution failed to resolve the anomaly in non-RVV environments. After further investigation, the key bottleneck remains unidentified—I suspect there may be a hidden conflict between the batch processing logic in non-RVV environments and the dynamically adjusted block size. Unfortunately, due to my limited depth of understanding of the overall code flow, I cannot currently locate the specific problem location.

4. Request for Community Assistance

Since I am currently stuck at the critical node of "dynamic block size adaptation in non-RVV environments", I hope to leverage the community’s experience and perspectives to overcome this challenge together:

  1. If any community members have encountered similar issues related to "u64 signature overflow + environment-specific adaptation", could you share your troubleshooting ideas or solutions?
  2. Regarding "how to ensure compatibility between the adjusted block_size and the logic of encode_batch/encode_quad during batch processing in non-RVV environments", are there any better design approaches?
  3. Currently, it is unclear whether there are other implicit constraints in the batch logic of non-RVV environments that affect the effectiveness of block size adjustments. If any community members are familiar with this part of the code, could you help sort out the logic flow together?

Going forward, I will continue to organize the call stack and log details of the non-RVV environment to supplement more debugging information. Thank you all in the community for your help—I look forward to resolving this issue together and enabling the vector compression feature to cover more environment scenarios! 💪

Dayuxiaoshui and others added 2 commits August 29, 2025 18:37
Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
@g1mv
Copy link
Copy Markdown
Owner

g1mv commented Aug 29, 2025

Hi @Dayuxiaoshui !
Thanks for your work!
All CI checks are now passing which is a good thing !! However, on non-RVV platforms things are now noticeably slower on the compression side.

density v0.16.6 main:

Running benches/density.rs (target/release/deps/density-ffb3911c7d1e6161)
Using file ./benches/data/dickens.txt (10192446 bytes)
Timer precision: 41 ns
density                            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ chameleon                                     │               │               │               │         │
│  ├─ compress/raw      (1.749x)   4.621 ms      │ 5.143 ms      │ 4.714 ms      │ 4.752 ms      │ 25      │ 25
│  │                               2.205 GB/s    │ 1.981 GB/s    │ 2.161 GB/s    │ 2.144 GB/s    │         │
│  ╰─ decompress/raw               3.465 ms      │ 3.637 ms      │ 3.499 ms      │ 3.505 ms      │ 25      │ 25
│                                  2.941 GB/s    │ 2.802 GB/s    │ 2.912 GB/s    │ 2.907 GB/s    │         │
├─ cheetah                                       │               │               │               │         │
│  ├─ compress/raw      (1.860x)   8.508 ms      │ 8.764 ms      │ 8.668 ms      │ 8.653 ms      │ 25      │ 25
│  │                               1.197 GB/s    │ 1.162 GB/s    │ 1.175 GB/s    │ 1.177 GB/s    │         │
│  ╰─ decompress/raw               5.711 ms      │ 6.133 ms      │ 5.757 ms      │ 5.78 ms       │ 25      │ 25
│                                  1.784 GB/s    │ 1.661 GB/s    │ 1.77 GB/s     │ 1.763 GB/s    │         │
╰─ lion                                          │               │               │               │         │
   ├─ compress/raw      (1.966x)   14.26 ms      │ 14.71 ms      │ 14.49 ms      │ 14.48 ms      │ 25      │ 25
   │                               714.4 MB/s    │ 692.6 MB/s    │ 702.9 MB/s    │ 703.4 MB/s    │         │
   ╰─ decompress/raw               9.846 ms      │ 10.09 ms      │ 9.9 ms        │ 9.907 ms      │ 25      │ 25
                                   1.035 GB/s    │ 1.009 GB/s    │ 1.029 GB/s    │ 1.028 GB/s    │         │

     Running benches/lz4.rs (target/release/deps/lz4-81d65a4341189a2c)
Using file ./benches/data/dickens.txt (10192446 bytes)
Timer precision: 41 ns
lz4                                fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ default                                       │               │               │               │         │
   ├─ compress/raw      (1.585x)   21.58 ms      │ 22.14 ms      │ 21.77 ms      │ 21.77 ms      │ 25      │ 25
   │                               472.2 MB/s    │ 460.3 MB/s    │ 468 MB/s      │ 468 MB/s      │         │
   ╰─ decompress/raw               3.408 ms      │ 3.573 ms      │ 3.467 ms      │ 3.473 ms      │ 25      │ 25
                                   2.99 GB/s     │ 2.851 GB/s    │ 2.939 GB/s    │ 2.934 GB/s    │         │

     Running benches/snappy.rs (target/release/deps/snappy-24f528086ee7362b)
Using file ./benches/data/dickens.txt (10192446 bytes)
Timer precision: 41 ns
snappy                             fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ default                                       │               │               │               │         │
   ├─ compress/stream   (1.607x)   30.13 ms      │ 30.62 ms      │ 30.24 ms      │ 30.27 ms      │ 25      │ 25
   │                               338.2 MB/s    │ 332.8 MB/s    │ 336.9 MB/s    │ 336.7 MB/s    │         │
   ╰─ decompress/stream            13.06 ms      │ 13.49 ms      │ 13.14 ms      │ 13.17 ms      │ 25      │ 25
                                   779.9 MB/s    │ 755.2 MB/s    │ 775.5 MB/s    │ 773.6 MB/s    │         │

     Running benches/utils.rs (target/release/deps/utils-1bbbc75f001bbb6c)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

your PR:

Running benches/density.rs (target/release/deps/density-dd321f5217221347)
Using file ./benches/data/dickens.txt (10192446 bytes)
Timer precision: 41 ns
density                            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ chameleon                                     │               │               │               │         │
│  ├─ compress/raw      (1.749x)   11.52 ms      │ 11.9 ms       │ 11.64 ms      │ 11.64 ms      │ 25      │ 25
│  │                               884.4 MB/s    │ 856.1 MB/s    │ 875.4 MB/s    │ 875.1 MB/s    │         │
│  ╰─ decompress/raw               3.397 ms      │ 3.582 ms      │ 3.454 ms      │ 3.459 ms      │ 25      │ 25
│                                  2.999 GB/s    │ 2.844 GB/s    │ 2.95 GB/s     │ 2.946 GB/s    │         │
├─ cheetah                                       │               │               │               │         │
│  ├─ compress/raw      (1.860x)   17.7 ms       │ 18.22 ms      │ 17.88 ms      │ 17.92 ms      │ 25      │ 25
│  │                               575.7 MB/s    │ 559.1 MB/s    │ 570 MB/s      │ 568.5 MB/s    │         │
│  ╰─ decompress/raw               5.801 ms      │ 6.096 ms      │ 5.874 ms      │ 5.886 ms      │ 25      │ 25
│                                  1.757 GB/s    │ 1.671 GB/s    │ 1.735 GB/s    │ 1.731 GB/s    │         │
╰─ lion                                          │               │               │               │         │
   ├─ compress/raw      (1.966x)   27.24 ms      │ 27.48 ms      │ 27.43 ms      │ 27.41 ms      │ 25      │ 25
   │                               374.1 MB/s    │ 370.8 MB/s    │ 371.5 MB/s    │ 371.8 MB/s    │         │
   ╰─ decompress/raw               9.842 ms      │ 10.07 ms      │ 10 ms         │ 9.998 ms      │ 25      │ 25
                                   1.035 GB/s    │ 1.012 GB/s    │ 1.018 GB/s    │ 1.019 GB/s    │         │

     Running benches/lz4.rs (target/release/deps/lz4-ac441e73188f3775)
Using file ./benches/data/dickens.txt (10192446 bytes)
Timer precision: 41 ns
lz4                                fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ default                                       │               │               │               │         │
   ├─ compress/raw      (1.585x)   21.39 ms      │ 21.97 ms      │ 21.74 ms      │ 21.7 ms       │ 25      │ 25
   │                               476.4 MB/s    │ 463.9 MB/s    │ 468.6 MB/s    │ 469.6 MB/s    │         │
   ╰─ decompress/raw               3.406 ms      │ 3.796 ms      │ 3.527 ms      │ 3.515 ms      │ 25      │ 25
                                   2.992 GB/s    │ 2.684 GB/s    │ 2.889 GB/s    │ 2.899 GB/s    │         │

     Running benches/snappy.rs (target/release/deps/snappy-cbc81d637effbcad)
Using file ./benches/data/dickens.txt (10192446 bytes)
Timer precision: 41 ns
snappy                             fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ default                                       │               │               │               │         │
   ├─ compress/stream   (1.607x)   30.16 ms      │ 30.77 ms      │ 30.29 ms      │ 30.31 ms      │ 25      │ 25
   │                               337.8 MB/s    │ 331.1 MB/s    │ 336.3 MB/s    │ 336.2 MB/s    │         │
   ╰─ decompress/stream            13.11 ms      │ 13.49 ms      │ 13.24 ms      │ 13.25 ms      │ 25      │ 25
                                   776.9 MB/s    │ 755.1 MB/s    │ 769.4 MB/s    │ 769.1 MB/s    │         │

     Running benches/utils.rs (target/release/deps/utils-85a1b593e42ef6e8)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

I suspect this has to do with initially-optimized code that has been modified (see #96 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants