Skip to content

Conversation

@PeterPtroc
Copy link

Description

This PR adds support for RISC-V hardware acceleration to the Abseil CRC32C implementation bundled in abseil. Currently, MySQL on RISC-V just falls back to the software implementation.

Implementation Details

  • Detection: Added SupportsRiscvCrc32() in cpu_detect.cc using the riscv_hwprobe syscall (Linux) to detect Zbc or Zbkc extensions at runtime.
  • Build: Updated CMakeLists.txt to check for compiler support of -march=rv64gc_zbc or -march=rv64gc_zbkc.
  • Algorithm: Implemented a carry-less multiplication based CRC32C algorithm in crc_riscv.cc, mirroring the logic used in the x86/ARM implementations but adapted for RISC-V instructions (clmul, clmulh).

Performance Results

Benchmarks were run on a RISC-V server (SG2044).

Environment:

  • CPU: 64 Cores @ 2.6 GHz
  • OS: openEuler (Linux)

Comparison

Benchmark Baseline Time (ns) Accelerated Time (ns) Improvement
BM_Calculate/500000 7,773,083 2,994,595 2.60x
BM_Extend/500000 7,779,846 2,736,667 2.84x
BM_Memcpy/500000 7,867,667 2,782,868 2.83x

Throughput for BM_Memcpy (500KB) improved from 60.7 MiB/s to 171.7 MiB/s.

Testing

  • Unit Tests: Ran crc32c_test, and passed all 11 built-in unit tests from the CRC32C test suite.

Impact

This change improves the performance of MySQL on RISC-V platforms, specifically for operations involving CRC32C calculations.

Raw Data

Baseline Benchmark Data
[*]# ./build/crc32c_benchmark
*
Running ./build/crc32c_benchmark
Run on (64 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x16)
  L3 Unified 65536 KiB (x1)
Load Average: 0.40, 0.92, 4.39
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_Calculate/0                       121 ns          121 ns      5935680
BM_Calculate/1                       128 ns          128 ns      5345757
BM_Calculate/100                    1731 ns         1728 ns       405043
BM_Calculate/10000                154904 ns       154673 ns         4536
BM_Calculate/500000              7773083 ns      7760803 ns           90
BM_Extend/0                          114 ns          114 ns      5994212
BM_Extend/1                          127 ns          127 ns      5591204
BM_Extend/100                       1722 ns         1720 ns       402750
BM_Extend/10000                   155333 ns       154604 ns         4540
BM_Extend/500000                 7779846 ns      7766271 ns           90
BM_Extend/100000000           1562327999 ns   1557962520 ns            1
BM_ExtendCacheMiss/10         3340856196 ns   3330776020 ns            1 bytes_per_second=42.9483Mi/s
BM_ExtendCacheMiss/100        2861855888 ns   2850998120 ns            1 bytes_per_second=50.1758Mi/s
BM_ExtendCacheMiss/1000       2365986213 ns   2357723020 ns            1 bytes_per_second=60.6734Mi/s
BM_ExtendCacheMiss/100000     2346764538 ns   2337288360 ns            1 bytes_per_second=61.2039Mi/s
BM_ExtendByZeroes/1                  133 ns          132 ns      5209676
BM_ExtendByZeroes/10                 132 ns          131 ns      5264641
BM_ExtendByZeroes/100                214 ns          214 ns      3245670
BM_ExtendByZeroes/1000               302 ns          302 ns      2315424
BM_ExtendByZeroes/10000              302 ns          301 ns      2276964
BM_ExtendByZeroes/100000             382 ns          381 ns      1834109
BM_ExtendByZeroes/1000000            378 ns          378 ns      1823858
BM_ExtendByZeroes/1                  134 ns          133 ns      5245663
BM_ExtendByZeroes/32                 139 ns          138 ns      5070037
BM_ExtendByZeroes/1024               141 ns          141 ns      4842622
BM_ExtendByZeroes/32768              146 ns          146 ns      4825369
BM_ExtendByZeroes/1048576            153 ns          153 ns      4555473
BM_UnextendByZeroes/1                222 ns          222 ns      3105543
BM_UnextendByZeroes/10               225 ns          225 ns      3077742
BM_UnextendByZeroes/100              306 ns          306 ns      2287878
BM_UnextendByZeroes/1000             392 ns          390 ns      1791533
BM_UnextendByZeroes/10000            400 ns          399 ns      1759031
BM_UnextendByZeroes/100000           485 ns          484 ns      1471271
BM_UnextendByZeroes/1000000          483 ns          482 ns      1459525
BM_UnextendByZeroes/1                228 ns          227 ns      3102251
BM_UnextendByZeroes/32               231 ns          230 ns      3039191
BM_UnextendByZeroes/1024             236 ns          235 ns      2981607
BM_UnextendByZeroes/32768            244 ns          244 ns      2939284
BM_UnextendByZeroes/1048576          254 ns          254 ns      2748236
BM_Concat/1                          136 ns          135 ns      5231340
BM_Concat/10                         135 ns          135 ns      5297731
BM_Concat/100                        221 ns          220 ns      3191151
BM_Concat/1000                       306 ns          306 ns      2304669
BM_Concat/10000                      310 ns          310 ns      2302788
BM_Concat/100000                     393 ns          392 ns      1768869
BM_Concat/1000000                    399 ns          399 ns      1802905
BM_Concat/1                          137 ns          137 ns      5214617
BM_Concat/32                         142 ns          142 ns      4916018
BM_Concat/1024                       146 ns          145 ns      4965801
BM_Concat/32768                      151 ns          151 ns      4627699
BM_Concat/1048576                    164 ns          164 ns      4327395
BM_Memcpy/0                          218 ns          217 ns      3218486 bytes_per_second=0/s
BM_Memcpy/1                          366 ns          365 ns      1920707 bytes_per_second=2.61004Mi/s
BM_Memcpy/100                       1992 ns         1987 ns       353483 bytes_per_second=47.9894Mi/s
BM_Memcpy/10000                   157493 ns       157222 ns         4456 bytes_per_second=60.6578Mi/s
BM_Memcpy/500000                 7867667 ns      7849380 ns           89 bytes_per_second=60.7484Mi/s
BM_RemoveSuffix/1/1                  240 ns          240 ns      3011473
BM_RemoveSuffix/100/10               240 ns          239 ns      2931486
BM_RemoveSuffix/100/100              321 ns          321 ns      2175971
BM_RemoveSuffix/10000/1              240 ns          240 ns      2908478
BM_RemoveSuffix/10000/100            324 ns          323 ns      2219736
BM_RemoveSuffix/10000/10000          410 ns          409 ns      1706248
BM_RemoveSuffix/500000/1             240 ns          239 ns      3001240
BM_RemoveSuffix/500000/100           322 ns          321 ns      2185117
BM_RemoveSuffix/500000/10000         406 ns          405 ns      1747082
BM_RemoveSuffix/500000/500000        493 ns          492 ns      1386872
Accelerated Benchmark Data
[*]# ./build/crc32c_benchmark
*
Running ./build/crc32c_benchmark
Run on (64 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 2048 KiB (x16)
  L3 Unified 65536 KiB (x1)
Load Average: 0.52, 0.49, 0.46
***WARNING*** Library was built as DEBUG. Timings may be affected.
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_Calculate/0                       126 ns          126 ns      5655934
BM_Calculate/1                       128 ns          128 ns      5592517
BM_Calculate/100                     772 ns          771 ns       914064
BM_Calculate/10000                 52917 ns        52776 ns        13513
BM_Calculate/500000              2994595 ns      2987540 ns          237
BM_Extend/0                          129 ns          129 ns      5406487
BM_Extend/1                          133 ns          133 ns      5309586
BM_Extend/100                        791 ns          790 ns       884593
BM_Extend/10000                    51865 ns        51794 ns        13516
BM_Extend/500000                 2736667 ns      2732697 ns          265
BM_Extend/100000000            547786361 ns    545972940 ns            1
BM_ExtendCacheMiss/10         3433398910 ns   3422962480 ns            1 bytes_per_second=41.7916Mi/s
BM_ExtendCacheMiss/100        1320704388 ns   1316158620 ns            1 bytes_per_second=108.688Mi/s
BM_ExtendCacheMiss/1000        828797659 ns    825967740 ns            1 bytes_per_second=173.192Mi/s
BM_ExtendCacheMiss/100000      838523661 ns    836219300 ns            1 bytes_per_second=171.069Mi/s
BM_ExtendByZeroes/1                  135 ns          134 ns      5222792
BM_ExtendByZeroes/10                 136 ns          135 ns      5273478
BM_ExtendByZeroes/100                216 ns          215 ns      3227236
BM_ExtendByZeroes/1000               307 ns          307 ns      2281443
BM_ExtendByZeroes/10000              310 ns          309 ns      2265856
BM_ExtendByZeroes/100000             398 ns          396 ns      1770937
BM_ExtendByZeroes/1000000            401 ns          400 ns      1756397
BM_ExtendByZeroes/1                  134 ns          134 ns      5278843
BM_ExtendByZeroes/32                 136 ns          136 ns      5095584
BM_ExtendByZeroes/1024               141 ns          140 ns      4981809
BM_ExtendByZeroes/32768              147 ns          147 ns      4712063
BM_ExtendByZeroes/1048576            155 ns          155 ns      4589329
BM_UnextendByZeroes/1                211 ns          210 ns      3308688
BM_UnextendByZeroes/10               212 ns          212 ns      3344752
BM_UnextendByZeroes/100              300 ns          299 ns      2368732
BM_UnextendByZeroes/1000             386 ns          385 ns      1844612
BM_UnextendByZeroes/10000            389 ns          389 ns      1812025
BM_UnextendByZeroes/100000           474 ns          472 ns      1477453
BM_UnextendByZeroes/1000000          467 ns          467 ns      1523173
BM_UnextendByZeroes/1                211 ns          210 ns      3310039
BM_UnextendByZeroes/32               217 ns          217 ns      3235875
BM_UnextendByZeroes/1024             219 ns          219 ns      3179369
BM_UnextendByZeroes/32768            229 ns          229 ns      3094327
BM_UnextendByZeroes/1048576          240 ns          240 ns      2925356
BM_Concat/1                          136 ns          136 ns      5274014
BM_Concat/10                         137 ns          137 ns      5165755
BM_Concat/100                        220 ns          220 ns      3216563
BM_Concat/1000                       305 ns          304 ns      2311276
BM_Concat/10000                      315 ns          314 ns      2232102
BM_Concat/100000                     400 ns          399 ns      1775429
BM_Concat/1000000                    395 ns          394 ns      1770390
BM_Concat/1                          135 ns          134 ns      5033814
BM_Concat/32                         139 ns          139 ns      4910071
BM_Concat/1024                       145 ns          145 ns      4813932
BM_Concat/32768                      151 ns          151 ns      4731028
BM_Concat/1048576                    162 ns          162 ns      4271424
BM_Memcpy/0                          218 ns          217 ns      3220307 bytes_per_second=0/s
BM_Memcpy/1                          370 ns          370 ns      1896454 bytes_per_second=2.58062Mi/s
BM_Memcpy/100                       1057 ns         1055 ns       662250 bytes_per_second=90.4238Mi/s
BM_Memcpy/10000                    55050 ns        54950 ns        12693 bytes_per_second=173.554Mi/s
BM_Memcpy/500000                 2782868 ns      2776637 ns          252 bytes_per_second=171.732Mi/s
BM_RemoveSuffix/1/1                  224 ns          224 ns      3120627
BM_RemoveSuffix/100/10               223 ns          223 ns      3143985
BM_RemoveSuffix/100/100              305 ns          305 ns      2285349
BM_RemoveSuffix/10000/1              224 ns          223 ns      3138593
BM_RemoveSuffix/10000/100            306 ns          306 ns      2278230
BM_RemoveSuffix/10000/10000          397 ns          396 ns      1767367
BM_RemoveSuffix/500000/1             224 ns          224 ns      3152227
BM_RemoveSuffix/500000/100           305 ns          304 ns      2253388
BM_RemoveSuffix/500000/10000         388 ns          387 ns      1794191
BM_RemoveSuffix/500000/500000        483 ns          482 ns      1477453
Test Output
[*]# ./build/crc32c_test
Running main() from /home/ptc/mysql/mysql-server/extra/abseil/abseil-cpp-20230802.1/standalone_build/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 11 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 11 tests from CRC32C
[ RUN      ] CRC32C.RFC3720
[       OK ] CRC32C.RFC3720 (0 ms)
[ RUN      ] CRC32C.Compute
[       OK ] CRC32C.Compute (0 ms)
[ RUN      ] CRC32C.Extend
[       OK ] CRC32C.Extend (0 ms)
[ RUN      ] CRC32C.ExtendByZeroes
[       OK ] CRC32C.ExtendByZeroes (1 ms)
[ RUN      ] CRC32C.UnextendByZeroes
[       OK ] CRC32C.UnextendByZeroes (0 ms)
[ RUN      ] CRC32C.Concat
[       OK ] CRC32C.Concat (0 ms)
[ RUN      ] CRC32C.Memcpy
[       OK ] CRC32C.Memcpy (2 ms)
[ RUN      ] CRC32C.RemovePrefix
[       OK ] CRC32C.RemovePrefix (0 ms)
[ RUN      ] CRC32C.RemoveSuffix
[       OK ] CRC32C.RemoveSuffix (0 ms)
[ RUN      ] CRC32C.InsertionOperator
[       OK ] CRC32C.InsertionOperator (0 ms)
[ RUN      ] CRC32C.AbslStringify
[       OK ] CRC32C.AbslStringify (0 ms)
[----------] 11 tests from CRC32C (5 ms total)

[----------] Global test environment tear-down
[==========] 11 tests from 1 test suite ran. (5 ms total)
[  PASSED  ] 11 tests.

This patch introduces hardware acceleration for CRC32C on RISC-V
architecture within the bundled Abseil library (`extra/abseil`).

The implementation utilizes the RISC-V Zbc (Carry-less multiplication)
or Zbkc extensions to accelerate CRC calculations. Runtime feature
detection is implemented using `riscv_hwprobe` on Linux systems.

Performance benchmarks on a RISC-V server show significant improvements
(approx. 2.6x - 2.8x speedup for large buffers):

Benchmark (500KB data)  | Baseline (ns) | Accelerated (ns) | Speedup
------------------------|---------------|------------------|--------
BM_Calculate/500000     | 7,773,083     | 2,994,595        | 2.60x
BM_Extend/500000        | 7,779,846     | 2,736,667        | 2.84x
BM_Memcpy/500000        | 7,867,667     | 2,782,868        | 2.83x

Throughput for `BM_Memcpy` increased from ~60 MiB/s to ~171 MiB/s.

Verified with `crc32c_test`, all 11 tests passed.

Co-authored-by: gong-flying <gongxiaofei24@iscas.ac.cn>
Signed-off-by: PeterPtroc <2402365479@qq.com>
@mysql-oca-bot
Copy link

Hi, thank you for submitting this pull request. In order to consider your code we need you to sign the Oracle Contribution Agreement (OCA). Please review the details and follow the instructions at https://oca.opensource.oracle.com/
Please make sure to include your MySQL bug system user (email) in the returned form.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants