Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 5, 2020

BinaryBitBlockCounter computes the popcount of the bitwise-and of each corresponding bit-run (with target of 64 bits at a time) of two bitmaps. This permits iterating through two validity bitmaps that don't have a lot of nulls much more quickly than using two BitmapReaders.

I also added an inline-able BitBlockCounter::NextWordInline for 64-bits at a time for a single bitmap. It seems like this may be preferable to the four word version. Now we have NextWord and NextFourWords so the developer can choose either variant.

Benchmarks and tests covering all this are included. I'll post the benchmarks on my machine as a comment.

@wesm
Copy link
Member Author

wesm commented Jun 5, 2020

There are some issues with the benchmarks, I'll fix them and repost numbers

@github-actions
Copy link

github-actions bot commented Jun 5, 2020

@wesm
Copy link
Member Author

wesm commented Jun 5, 2020

OK here are the fixed benchmarks. These match my intuition now. This shows that there is very little downside and nearly always upside to using BitBlockCounter over BitmapReader (at least when there is 1% nulls or less, when there is a higher percentage you don't drop that much performance for using it). There isn't a benchmark showing the naive double-BitmapReader for the binary case but that could be added too

-----------------------------------------------------------------------------------
Benchmark                                            Time           CPU Iterations
-----------------------------------------------------------------------------------
BitBlockCounterSum/8                           1732025 ns    1732026 ns        400   577.358M items/s
BitBlockCounterSum/64                           832029 ns     832025 ns        836   1.17372G items/s
BitBlockCounterSum/512                          265104 ns     265105 ns       2666   3.68368G items/s
BitBlockCounterSum/4096                         151056 ns     151056 ns       4647   6.46488G items/s
BitBlockCounterSum/32768                        138635 ns     138632 ns       5081    7.0443G items/s
BitBlockCounterSum/65536                        137021 ns     137019 ns       4977    7.1272G items/s
BitBlockCounterSumWithOffset/8                 1779549 ns    1779525 ns        395   561.948M items/s
BitBlockCounterSumWithOffset/64                 855158 ns     855149 ns        823   1.14198G items/s
BitBlockCounterSumWithOffset/512                273883 ns     273874 ns       2578   3.56574G items/s
BitBlockCounterSumWithOffset/4096               154422 ns     154422 ns       4499   6.32397G items/s
BitBlockCounterSumWithOffset/32768              141266 ns     141265 ns       4887   6.91299G items/s
BitBlockCounterSumWithOffset/65536              140049 ns     140048 ns       5024   6.97305G items/s
BitBlockCounterInlineSum/8                     1714554 ns    1714564 ns        410   583.239M items/s
BitBlockCounterInlineSum/64                     802222 ns     802227 ns        878   1.21731G items/s
BitBlockCounterInlineSum/512                    239832 ns     239831 ns       2897   4.07188G items/s
BitBlockCounterInlineSum/4096                   129035 ns     129031 ns       5384   7.56842G items/s
BitBlockCounterInlineSum/32768                  116983 ns     116981 ns       6003   8.34801G items/s
BitBlockCounterInlineSum/65536                  116078 ns     116079 ns       6085    8.4129G items/s
BitBlockCounterInlineSumWithOffset/8           1682920 ns    1682911 ns        414   594.209M items/s
BitBlockCounterInlineSumWithOffset/64           809214 ns     809211 ns        874   1.20681G items/s
BitBlockCounterInlineSumWithOffset/512          251953 ns     251951 ns       2756     3.876G items/s
BitBlockCounterInlineSumWithOffset/4096         139827 ns     139826 ns       4988   6.98414G items/s
BitBlockCounterInlineSumWithOffset/32768        127648 ns     127648 ns       5498   7.65041G items/s
BitBlockCounterInlineSumWithOffset/65536        126759 ns     126756 ns       5551   7.70425G items/s
BitBlockCounterFourWordsSum/8                  1655902 ns    1655861 ns        423   603.916M items/s
BitBlockCounterFourWordsSum/64                 1000517 ns    1000507 ns        692   999.493M items/s
BitBlockCounterFourWordsSum/512                 441463 ns     441466 ns       1595   2.21209G items/s
BitBlockCounterFourWordsSum/4096                128194 ns     128193 ns       5484   7.61788G items/s
BitBlockCounterFourWordsSum/32768                85335 ns      85334 ns       8050    11.444G items/s
BitBlockCounterFourWordsSum/65536                82101 ns      82101 ns       8498   11.8947G items/s
BitBlockCounterFourWordsSumWithOffset/8        1647208 ns    1647201 ns        422   607.091M items/s
BitBlockCounterFourWordsSumWithOffset/64       1025215 ns    1025183 ns        700   975.436M items/s
BitBlockCounterFourWordsSumWithOffset/512       462082 ns     462074 ns       1572   2.11343G items/s
BitBlockCounterFourWordsSumWithOffset/4096      132541 ns     132540 ns       5257   7.36808G items/s
BitBlockCounterFourWordsSumWithOffset/32768      92098 ns      92098 ns       7651   10.6035G items/s
BitBlockCounterFourWordsSumWithOffset/65536      87406 ns      87406 ns       7908   11.1727G items/s
BitmapReaderSum/8                              1600625 ns    1600619 ns        442   624.758M items/s
BitmapReaderSum/64                              885446 ns     885445 ns        789   1.10291G items/s
BitmapReaderSum/512                             805230 ns     805219 ns        862   1.21279G items/s
BitmapReaderSum/4096                            794678 ns     794676 ns        870   1.22888G items/s
BitmapReaderSum/32768                           793758 ns     793749 ns        869   1.23032G items/s
BitmapReaderSum/65536                           794828 ns     794812 ns        879   1.22867G items/s
BitmapReaderSumWithOffset/8                    1667559 ns    1667514 ns        419   599.695M items/s
BitmapReaderSumWithOffset/64                    930337 ns     930335 ns        755   1074.88M items/s
BitmapReaderSumWithOffset/512                   841240 ns     841236 ns        841   1.16087G items/s
BitmapReaderSumWithOffset/4096                  840091 ns     840087 ns        853   1.16245G items/s
BitmapReaderSumWithOffset/32768                 828098 ns     828103 ns        846   1.17928G items/s
BitmapReaderSumWithOffset/65536                 831186 ns     831191 ns        854    1.1749G items/s
BinaryBitBlockCounterSum/8                     2974962 ns    2974893 ns        235   336.146M items/s
BinaryBitBlockCounterSum/64                    1697417 ns    1697403 ns        414   589.135M items/s
BinaryBitBlockCounterSum/512                    622981 ns     622973 ns       1165   1.56758G items/s
BinaryBitBlockCounterSum/4096                   251586 ns     251582 ns       2831   3.88168G items/s
BinaryBitBlockCounterSum/32768                  202682 ns     202683 ns       3345   4.81817G items/s
BinaryBitBlockCounterSum/65536                  192151 ns     192150 ns       3653    5.0823G items/s
BinaryBitBlockCounterSumWithOffset/8           3178632 ns    3178625 ns        224   314.601M items/s
BinaryBitBlockCounterSumWithOffset/64          1713947 ns    1713944 ns        404   583.449M items/s
BinaryBitBlockCounterSumWithOffset/512          605481 ns     605476 ns       1158   1.61288G items/s
BinaryBitBlockCounterSumWithOffset/4096         258490 ns     258489 ns       2716   3.77796G items/s
BinaryBitBlockCounterSumWithOffset/32768        212582 ns     212577 ns       3273   4.59393G items/s
BinaryBitBlockCounterSumWithOffset/65536        208857 ns     208857 ns       3357   4.67575G items/s

@wesm
Copy link
Member Author

wesm commented Jun 5, 2020

I'm adding a BitmapReader-based comparison for the binary case. Stay tuned

@wesm
Copy link
Member Author

wesm commented Jun 5, 2020

OK, here are the binary benchmarks:

--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BinaryBitBlockCounterSum/8                  3189138 ns    3189079 ns        216    313.57M items/s
BinaryBitBlockCounterSum/64                 1839419 ns    1839359 ns        390   543.668M items/s
BinaryBitBlockCounterSum/512                 630842 ns     630808 ns       1121   1.54811G items/s
BinaryBitBlockCounterSum/4096                256330 ns     256332 ns       2746   3.80976G items/s
BinaryBitBlockCounterSum/32768               204388 ns     204383 ns       3454   4.77809G items/s
BinaryBitBlockCounterSum/65536               201268 ns     201260 ns       3428   4.85225G items/s
BinaryBitBlockCounterSumWithOffset/8        3313859 ns    3313805 ns        206   301.768M items/s
BinaryBitBlockCounterSumWithOffset/64       1966957 ns    1966805 ns        360   508.439M items/s
BinaryBitBlockCounterSumWithOffset/512       672431 ns     672434 ns       1088   1.45228G items/s
BinaryBitBlockCounterSumWithOffset/4096      286651 ns     286643 ns       2469   3.40689G items/s
BinaryBitBlockCounterSumWithOffset/32768     228652 ns     228648 ns       3048   4.27103G items/s
BinaryBitBlockCounterSumWithOffset/65536     228191 ns     228188 ns       3171   4.27964G items/s
BinaryBitmapReaderSum/8                     3803716 ns    3803704 ns        183   262.902M items/s
BinaryBitmapReaderSum/64                    2184717 ns    2184728 ns        316   457.723M items/s
BinaryBitmapReaderSum/512                   2018442 ns    2018421 ns        344   495.437M items/s
BinaryBitmapReaderSum/4096                  1997782 ns    1997729 ns        349   500.568M items/s
BinaryBitmapReaderSum/32768                 2024333 ns    2024318 ns        367   493.994M items/s
BinaryBitmapReaderSum/65536                 2018332 ns    2018340 ns        346   495.457M items/s
BinaryBitmapReaderSumWithOffset/8           3926170 ns    3926185 ns        181     254.7M items/s
BinaryBitmapReaderSumWithOffset/64          2198425 ns    2198417 ns        323   454.873M items/s
BinaryBitmapReaderSumWithOffset/512         2001917 ns    2001864 ns        352   499.535M items/s
BinaryBitmapReaderSumWithOffset/4096        1980845 ns    1980853 ns        351   504.833M items/s
BinaryBitmapReaderSumWithOffset/32768       1979394 ns    1979403 ns        365   505.203M items/s
BinaryBitmapReaderSumWithOffset/65536       2029335 ns    2029347 ns        345   492.769M items/s

It seems that it is never a good idea to use BitmapReader for the binary case, even when the incidence of nulls is high, that even in that case naively using BitUtil::GetBit is better.

@wesm
Copy link
Member Author

wesm commented Jun 5, 2020

I'm going to be basing some patches on top of this, I can rebase whenever this gets reviewed/merged

@pitrou
Copy link
Member

pitrou commented Jun 8, 2020

So the "/8" benchmarks mean that only 12.5% values are nulls, right?

@pitrou
Copy link
Member

pitrou commented Jun 8, 2020

I figured the case with 50% nulls would be the least favorable to this new infrastructure so I changed the benchmark parameters a bit and got this:

BitBlockCounterSum/2                          3954282 ns      3953801 ns          176 items_per_second=265.207M/s
BitBlockCounterSum/8                          1625945 ns      1625729 ns          432 items_per_second=644.988M/s
BitBlockCounterSum/64                          810721 ns       810606 ns          882 items_per_second=1.29357G/s
BitBlockCounterSum/512                         317532 ns       317493 ns         2206 items_per_second=3.30267G/s
BitBlockCounterSum/1024                        265549 ns       265517 ns         2625 items_per_second=3.94918G/s
BitBlockCounterSumWithOffset/2                3966957 ns      3966456 ns          176 items_per_second=264.361M/s
BitBlockCounterSumWithOffset/8                1620578 ns      1620385 ns          433 items_per_second=647.116M/s
BitBlockCounterSumWithOffset/64                825890 ns       825784 ns          845 items_per_second=1.2698G/s
BitBlockCounterSumWithOffset/512               409940 ns       409894 ns         1711 items_per_second=2.55817G/s
BitBlockCounterSumWithOffset/1024              364805 ns       364756 ns         1923 items_per_second=2.87474G/s

BitBlockCounterInlineSum/2                    3922166 ns      3921700 ns          178 items_per_second=267.378M/s
BitBlockCounterInlineSum/8                    1628186 ns      1627982 ns          431 items_per_second=644.095M/s
BitBlockCounterInlineSum/64                    791733 ns       791639 ns          887 items_per_second=1.32456G/s
BitBlockCounterInlineSum/512                   315124 ns       315084 ns         2221 items_per_second=3.32792G/s
BitBlockCounterInlineSum/1024                  263135 ns       263103 ns         2663 items_per_second=3.98542G/s
BitBlockCounterInlineSumWithOffset/2          3988119 ns      3987657 ns          176 items_per_second=262.955M/s
BitBlockCounterInlineSumWithOffset/8          1606539 ns      1606341 ns          436 items_per_second=652.773M/s
BitBlockCounterInlineSumWithOffset/64          804304 ns       804209 ns          872 items_per_second=1.30386G/s
BitBlockCounterInlineSumWithOffset/512         398009 ns       397964 ns         1758 items_per_second=2.63485G/s
BitBlockCounterInlineSumWithOffset/1024        355099 ns       355057 ns         1973 items_per_second=2.95326G/s

BitBlockCounterFourWordsSum/2                 3902537 ns      3902025 ns          179 items_per_second=268.726M/s
BitBlockCounterFourWordsSum/8                 1554537 ns      1554353 ns          451 items_per_second=674.606M/s
BitBlockCounterFourWordsSum/64                 928975 ns       928856 ns          754 items_per_second=1.12889G/s
BitBlockCounterFourWordsSum/512                480744 ns       480683 ns         1457 items_per_second=2.18143G/s
BitBlockCounterFourWordsSum/1024               357280 ns       357238 ns         1966 items_per_second=2.93523G/s
BitBlockCounterFourWordsSumWithOffset/2       3819196 ns      3818711 ns          183 items_per_second=274.589M/s
BitBlockCounterFourWordsSumWithOffset/8       1612833 ns      1612643 ns          442 items_per_second=650.222M/s
BitBlockCounterFourWordsSumWithOffset/64       950047 ns       949935 ns          752 items_per_second=1.10384G/s
BitBlockCounterFourWordsSumWithOffset/512      501003 ns       500938 ns         1393 items_per_second=2.09323G/s
BitBlockCounterFourWordsSumWithOffset/1024     382677 ns       382633 ns         1830 items_per_second=2.74043G/s

BitmapReaderSum/2                             3696719 ns      3696276 ns          189 items_per_second=283.684M/s
BitmapReaderSum/8                             1532071 ns      1531893 ns          453 items_per_second=684.497M/s
BitmapReaderSum/64                             870072 ns       869960 ns          802 items_per_second=1.20531G/s
BitmapReaderSum/512                            811363 ns       811267 ns          854 items_per_second=1.29252G/s
BitmapReaderSum/1024                           803660 ns       803547 ns          872 items_per_second=1.30493G/s
BitmapReaderSumWithOffset/2                   3703315 ns      3702847 ns          187 items_per_second=283.181M/s
BitmapReaderSumWithOffset/8                   1551936 ns      1551728 ns          443 items_per_second=675.747M/s
BitmapReaderSumWithOffset/64                   893083 ns       892980 ns          784 items_per_second=1.17424G/s
BitmapReaderSumWithOffset/512                  812304 ns       812198 ns          861 items_per_second=1.29103G/s
BitmapReaderSumWithOffset/1024                 806786 ns       806691 ns          869 items_per_second=1.29985G/s

BinaryBitBlockCounterSum/2                    5629527 ns      5628860 ns          123 items_per_second=186.286M/s
BinaryBitBlockCounterSum/8                    2705594 ns      2705251 ns          259 items_per_second=387.608M/s
BinaryBitBlockCounterSum/64                   1577521 ns      1577334 ns          442 items_per_second=664.777M/s
BinaryBitBlockCounterSum/512                   602056 ns       601984 ns         1164 items_per_second=1.74187G/s
BinaryBitBlockCounterSum/1024                  436688 ns       436629 ns         1602 items_per_second=2.40153G/s
BinaryBitBlockCounterSumWithOffset/2          5772084 ns      5771353 ns          121 items_per_second=181.686M/s
BinaryBitBlockCounterSumWithOffset/8          2783490 ns      2783154 ns          251 items_per_second=376.758M/s
BinaryBitBlockCounterSumWithOffset/64         1611679 ns      1611491 ns          435 items_per_second=650.687M/s
BinaryBitBlockCounterSumWithOffset/512         766500 ns       766401 ns          914 items_per_second=1.36818G/s
BinaryBitBlockCounterSumWithOffset/1024        628334 ns       628256 ns         1112 items_per_second=1.66903G/s

BinaryBitmapReaderSum/2                       5747415 ns      5746682 ns          121 items_per_second=182.466M/s
BinaryBitmapReaderSum/8                       2939783 ns      2939433 ns          238 items_per_second=356.727M/s
BinaryBitmapReaderSum/64                      1721385 ns      1721173 ns          410 items_per_second=609.222M/s
BinaryBitmapReaderSum/512                     1551311 ns      1551124 ns          445 items_per_second=676.01M/s
BinaryBitmapReaderSum/1024                    1529969 ns      1529784 ns          457 items_per_second=685.441M/s
BinaryBitmapReaderSumWithOffset/2             5813192 ns      5812361 ns          127 items_per_second=180.404M/s
BinaryBitmapReaderSumWithOffset/8             3059190 ns      3058824 ns          224 items_per_second=342.804M/s
BinaryBitmapReaderSumWithOffset/64            1705614 ns      1705393 ns          410 items_per_second=614.859M/s
BinaryBitmapReaderSumWithOffset/512           1552966 ns      1552768 ns          450 items_per_second=675.295M/s
BinaryBitmapReaderSumWithOffset/1024          1533101 ns      1532922 ns          455 items_per_second=684.037M/s

So even in the (presumably) least favorable case, BitBlockCounter seems competitive.

@wesm
Copy link
Member Author

wesm commented Jun 8, 2020

So the "/8" benchmarks mean that only 12.5% values are nulls, right?

Yes that's right.

I'm sort of speculating that BitmapReader may not be really beneficial outside of narrow microbenchmarks. It's about 10-15% faster than naive use of BitUtil::GetBit, but here we get the same or better better performance using GetBit + BitBlockCounter (even when we almost never have all-set blocks), perhaps because there is some word prefetching benefit?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice performance work.

@pitrou
Copy link
Member

pitrou commented Jun 8, 2020

It's about 10-15% faster than naive use of BitUtil::GetBit, but here we get the same or better better performance using GetBit + BitBlockCounter (even when we almost never have all-set blocks), perhaps because there is some word prefetching benefit?

I'm not sure about the underlying cause, but one possible explanation would be less branches and/or better branch prediction.

@wesm
Copy link
Member Author

wesm commented Jun 9, 2020

For posterity, here are the current benchmarks on my machine (i9-9960X)

https://gist.github.com/wesm/b54636fb871717df2f8e50559a07b787

Merging this once the build passes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants