vectorize 'auto' long decoding#11004
Conversation
Also, add many more tests.
| first = false; | ||
| } else { | ||
| curByte = (byte) ((curByte << 4) | ((value >> (numBytes << 3)) & 0xF)); | ||
| curByte = (byte) ((curByte << 4) | ((value >>> (numBytes << 3)) & 0xF)); |
There was a problem hiding this comment.
Was this a bug fix? If so: it's on the write side; does that mean there might be bad segments out there, or is there some reason that this line wouldn't have affected any already-written data that people might have? (Maybe negative numbers were never fed to this method.)
There was a problem hiding this comment.
oops, this was an accidental change, will revert
| * Unpack a non-contiguous vector of long values at the specified indexes and adjust them by the supplied delta base | ||
| * value. | ||
| */ | ||
| default int getDelta(long[] out, int outPosition, int[] indexes, int length, int indexOffset, int limit, long base) |
There was a problem hiding this comment.
Do you have evidence that the getDelta and getTable methods are helpful? (vs. the alternative: first calling a regular bulk get method, then applying the delta or table adjustment in a loop over the returned arrays)
They complexify the code quite a bit, so we should only include them if they are meaningfully better performance-wise.
There was a problem hiding this comment.
I haven't measured as you suggest with these exact get methods, (did an earlier test before they were filled out that seemed to show improvement though which is why I went down this path because cutting out the extra vector iteration seemed to be worth the math). i will try to run the numbers this evening to compare.
There was a problem hiding this comment.
I did the first 8 bits so far without pushdown, the results are 1-2 millis slower.
without pushdown:
Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows)
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-1 lz4-auto 1.0 5000000 0.0 avgt 5 20409.851 ± 998.810 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-2 lz4-auto 1.0 5000000 0.0 avgt 5 18935.144 ± 199.459 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-4 lz4-auto 1.0 5000000 0.0 avgt 5 18668.448 ± 538.985 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-8 lz4-auto 1.0 5000000 0.0 avgt 5 19010.981 ± 695.243 us/op
compared to with pushdown:
Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows)
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-1 lz4-auto 1.0 5000000 0.0 avgt 5 17892.362 ± 102.776 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-2 lz4-auto 1.0 5000000 0.0 avgt 5 17796.103 ± 417.847 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-4 lz4-auto 1.0 5000000 0.0 avgt 5 17879.496 ± 237.066 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-8 lz4-auto 1.0 5000000 0.0 avgt 5 17508.260 ± 560.856 us/op
I'll run the rest, but based on the results so far, since only the first 4 deserializers implement getTable, I would be in favor of leaving the pushdown in place since it isn't that much extra complexity.
There was a problem hiding this comment.
without pushdown:
Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows) (zeroProbability) Mode Cnt Score Error Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-12 lz4-auto 1.0 5000000 0.0 avgt 5 20574.724 ± 3674.509 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-16 lz4-auto 1.0 5000000 0.0 avgt 5 20893.434 ± 2894.104 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-20 lz4-auto 1.0 5000000 0.0 avgt 5 19857.499 ± 1926.944 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-24 lz4-auto 1.0 5000000 0.0 avgt 5 22194.340 ± 2624.983 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uinform-32 lz4-auto 1.0 5000000 0.0 avgt 5 18321.336 ± 465.633 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-40 lz4-auto 1.0 5000000 0.0 avgt 5 23252.329 ± 341.846 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-48 lz4-auto 1.0 5000000 0.0 avgt 5 25273.632 ± 1414.751 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-56 lz4-auto 1.0 5000000 0.0 avgt 5 26429.779 ± 2649.011 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-64 lz4-auto 1.0 5000000 0.0 avgt 5 21943.211 ± 2124.446 us/op
compare with pushdown:
Benchmark (distribution) (encoding) (filteredRowCountPercentage) (rows) (zeroProbability) Mode Cnt Score Error Units
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-12 lz4-auto 1.0 5000000 0.0 avgt 5 18272.440 ± 71.751 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-16 lz4-auto 1.0 5000000 0.0 avgt 5 19042.292 ± 595.685 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-20 lz4-auto 1.0 5000000 0.0 avgt 5 18782.746 ± 248.738 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-24 lz4-auto 1.0 5000000 0.0 avgt 5 19048.354 ± 160.025 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uinform-32 lz4-auto 1.0 5000000 0.0 avgt 5 17984.778 ± 633.691 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-40 lz4-auto 1.0 5000000 0.0 avgt 5 22070.007 ± 166.035 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-48 lz4-auto 1.0 5000000 0.0 avgt 5 22052.517 ± 2168.763 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-56 lz4-auto 1.0 5000000 0.0 avgt 5 25001.739 ± 259.184 us/op
ColumnarLongsSelectRowsFromGeneratorBenchmark.selectRowsVectorized uniform-64 lz4-auto 1.0 5000000 0.0 avgt 5 20325.369 ± 60.724 us/op
There was a problem hiding this comment.
Thanks for running the tests. It seems worth it to keep it this way.
| * Unpack a contiguous vector of long values at the specified start index of length and adjust them by the supplied | ||
| * delta base value. | ||
| */ | ||
| default void getDelta(long[] out, int outPosition, int startIndex, int length, long base) |
There was a problem hiding this comment.
Are the default implementations ever used? If not, we could remove them.
There was a problem hiding this comment.
i think the default getDelta methods could be removed, and the default getTable methods could probably throw an unsupported operation exception for sizes that don't support table encoding (or we could push that to the implementations so no defaults)
There was a problem hiding this comment.
I was wrong, the default contiguous getDelta methods are not used, but int[] indexes is used by all un-aligned decoders (1,2,4,12,20,24,40,48,56)
| import java.util.stream.Collectors; | ||
|
|
||
| @RunWith(Enclosed.class) | ||
| public class VSizeLongSerdeTest |
There was a problem hiding this comment.
The change in Mult4Ser suggests that we care about handling negative numbers, but this test class doesn't exercise negative numbers very much. (I think it only tests Long.MIN_VALUE, in testEveryPowerOfTwo.)
If negative numbers matter, we should extend the test cases in this file to cover them better. I'd suggest adding tests to EveryLittleBitTest that are similar to testEveryPowerOfTwo and testEveryPowerOfTwoMinusOne, but have the sign bit set (i.e. bitwise or with Long.MIN_VALUE).
If negative numbers aren't important, I'd suggest blocking them on the write side, i.e. have all the LongSerializers throw errors if they are fed negative numbers.
There was a problem hiding this comment.
I don't think its necessary since that change was accidental, and I think negative numbers shouldn't make it here because both delta and table encoding appear to make all numbers positive.
There was a problem hiding this comment.
In that case, IMO it'd be good to add checks to the serializers to make sure negative numbers aren't provided.
There was a problem hiding this comment.
Looking closer, I don't think it is necessary to add the checks based on how IntermediateColumnarLongsSerializer works.
For delta encoding, the delta is computed with LongMath.checkedSubtract which will ensure delta is a positive number and the checks ensure it won't be used if delta is Long.MAX_VALUE, or it overflowed because the range between values was too big, setting delta to -1. These checks make it so the value in the constructor of the writer fed to VSizeLongSerde.getBitsForMax will be appropriate. The minVal is used as the base, and the writer subtracts the base from every value while serializing so every number given to the serializer will be between 0 and delta.
For table encoding, since the values are replaced with their corresponding index into the table array, they should also always be 0 or positive.
Is it worth the overhead to make LongSerializer.write implementations check this for every value?
There was a problem hiding this comment.
It doesn't really matter much if the methods are being called correctly today. The purpose of the checks is in case someone starts calling these methods incorrectly in the future.
I would imagine the overhead is minimal given that LongSerializers are doing a single write per method call. (A precondition check shouldn't add much on top of the method call overhead that's already there, especially since the branch would be 100% predictable.)
If it turns out that the overhead is worth worrying about, then I'd add an assert instead of a regular precondition check. The assert would be skipped in production but will run during tests.
There was a problem hiding this comment.
It's not too bad, low end doesn't seem very different at all
Before adding Preconditions.checkArgument(value >= 0); to every write:
Benchmark (distribution) (encoding) (rows) (zeroProbability) Mode Cnt Score Error Units
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-1 lz4-auto 5000000 0.0 avgt 2 97.509 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-2 lz4-auto 5000000 0.0 avgt 2 148.329 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-4 lz4-auto 5000000 0.0 avgt 2 171.316 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-8 lz4-auto 5000000 0.0 avgt 2 242.792 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-12 lz4-auto 5000000 0.0 avgt 2 249.412 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-16 lz4-auto 5000000 0.0 avgt 2 254.304 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-20 lz4-auto 5000000 0.0 avgt 2 309.736 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-24 lz4-auto 5000000 0.0 avgt 2 359.252 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uinform-32 lz4-auto 5000000 0.0 avgt 2 416.235 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-40 lz4-auto 5000000 0.0 avgt 2 492.306 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-48 lz4-auto 5000000 0.0 avgt 2 611.416 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-56 lz4-auto 5000000 0.0 avgt 2 734.503 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-64 lz4-auto 5000000 0.0 avgt 2 669.983 ms/op
after:
Benchmark (distribution) (encoding) (rows) (zeroProbability) Mode Cnt Score Error Units
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-1 lz4-auto 5000000 0.0 avgt 2 100.281 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-2 lz4-auto 5000000 0.0 avgt 2 148.947 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-4 lz4-auto 5000000 0.0 avgt 2 178.900 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-8 lz4-auto 5000000 0.0 avgt 2 264.585 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-12 lz4-auto 5000000 0.0 avgt 2 262.852 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-16 lz4-auto 5000000 0.0 avgt 2 265.252 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-20 lz4-auto 5000000 0.0 avgt 2 341.000 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-24 lz4-auto 5000000 0.0 avgt 2 368.859 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uinform-32 lz4-auto 5000000 0.0 avgt 2 456.571 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-40 lz4-auto 5000000 0.0 avgt 2 635.758 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-48 lz4-auto 5000000 0.0 avgt 2 831.513 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-56 lz4-auto 5000000 0.0 avgt 2 975.933 ms/op
ColumnarLongsEncodeDataFromGeneratorBenchmark.encodeColumn uniform-64 lz4-auto 5000000 0.0 avgt 2 727.638 ms/op
There was a problem hiding this comment.
oops, preconditions broke test for 64 bit testEveryPowerOfTwo, which overflows, though thinking about it, I guess it wouldn't matter if 64 is negative because it has all the bits. I've made an exception for this case to make this test pass again.
|
thanks for the review @gianm 🤘 |
Regression introduced in apache#11004 due to overzealous optimization. Even though we replaced stateful usage of ByteBuffer with stateless usage of Memory, we still need to create a new object on "duplicate" due to semantics of setBuffer.
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by apache#11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by apache#11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
This PR fixes an issue when using 'auto' encoded LONG typed columns and the 'vectorized' query engine. These columns use a delta based bit-packing mechanism, and errors in the vectorized reader would cause it to incorrectly read column values for some bit sizes (1 through 32 bits). This is a regression caused by #11004, which added the optimized readers to improve performance, so impacts Druid versions 0.22.0+. While writing the test I finally got sad enough about IndexSpec not having a "builder", so I made one, and switched all the things to use it. Apologies for the noise in this bug fix PR, the only real changes are in VSizeLongSerde, and the tests that have been modified to cover the buggy behavior, VSizeLongSerdeTest and ExpressionVectorSelectorsTest. Everything else is just cleanup of IndexSpec usage.
Description
This PR specializes
LongDeserializerimplementations withgetDeltaandgetTablemethods to push down unpacking bits, delta encoding adjustment, and table lookups for vectors of data as far as possible and work more efficiently with vectorized query engines, primarily focusing on contiguous reads.It works by unrolling value reads to line up with
ByteBufferget methods to eliminate overlapping reads where possible, reading blocks of 8 values at a time for un-aligned values (1, 2, 4, 12, 20, 24, 40, 48, 56).ByteBufferaligned bit-packing widths (8, 16, 32, 64) actually performed worse when this same unrolling was performed, so instead they utilize a traditional for loop with the aligned get methods.Most of the improvement is on the small end since it has the most redundant/overlapping memory accesses, but is pretty decent improvement there.
Full column scans before/after on uniform distribution columns of varying bits to cover the entire set of value decoders
before:
after:
Full column scans on other value distribution before/after comparison
before:
after:
Full benchmarks on column scan value selects with simulated filters (up to full scan)
For the top line graph, the x axis is percent of rows selected by column scan (offset analog) and goes up to 1.0/contiguous scan. The y axis is the amount of time it took to select that many values. The numbers are kind of noisy because i didn't spend the multiple days necessary to do it proper, but it gives a decent idea. Nearly all of the improvement is on the full scan end of the spectrum (last datapoint), because the underlying
BitmapVectorOffsetwill never call the contiguous vectorized get methods (which I will fix in a follow-up PR) and the contiguous gets were the primary area of improvement in this PR. The bottom bar chart is the size of the column in bytes as it would be stored in a segment (encoded and/or compressed).In a follow-up PR i will probably argue for making this the default long encoding since it does pretty well with the vectorized engine, compared to lz4 longs.
Unpacking
I uh, drew pictures to help me get the shifts and masking correct, so I'll add them here in case they are any help to reviewers. Imagine each diagram for the bit width is the number of values to unpack (8) per loop, where each color is a separate value, and each box is 4 bits.
This PR has: