generic block compressed complex columns#16863
Conversation
changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`.
| ColumnConfig columnConfig | ||
| ) | ||
| { | ||
| deserializeColumn(buffer, builder); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
| */ | ||
| @Nullable | ||
| @Deprecated | ||
| public GenericColumnSerializer getSerializer(SegmentWriteOutMedium segmentWriteOutMedium, String column) |
Check notice
Code scanning / CodeQL
Useless parameter
| */ | ||
| @Nullable | ||
| @Deprecated | ||
| public GenericColumnSerializer getSerializer(SegmentWriteOutMedium segmentWriteOutMedium, String column) |
Check notice
Code scanning / CodeQL
Useless parameter
| { | ||
| return ComplexColumnSerializer.create(segmentWriteOutMedium, column, this.getObjectStrategy()); | ||
| // backwards compatibility | ||
| final GenericColumnSerializer serializer = getSerializer(segmentWriteOutMedium, column); |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
| return LargeColumnSupportedComplexColumnSerializer.create( | ||
| segmentWriteOutMedium, | ||
| column, | ||
| getObjectStrategy() |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
| segmentWriteOutMedium, | ||
| column, | ||
| indexSpec, | ||
| getObjectStrategy() |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
…umn only copies on read if required
…licate if we are going to copy anyway
| /** | ||
| * Whether the {@link #fromByteBuffer(ByteBuffer, int)}, {@link #fromByteBufferWithSize(ByteBuffer)}, and | ||
| * {@link #fromByteBufferSafe(ByteBuffer, int)} methods return an object that may retain a reference to the provided | ||
| * {@link ByteBuffer}. If a reference is sometimes retained, this method returns true. It returns false if, and only |
There was a problem hiding this comment.
Should clarify: does this mean "retains a reference to the specific ByteBuffer object" or "retains a reference to the same underlying memory"?
i.e., if an ObjectStrategy calls duplicate() on the buf and retains the duplicate, should it return true or false from this method?
There was a problem hiding this comment.
ah, yeah i suppose it does need to clarify that it means retains the same memory, will update
| final ByteBuffer dupe = decompressedDataBuffer.duplicate().order(byteOrder); | ||
| dupe.position(startBlockOffset).limit(startBlockOffset + size); | ||
| return dupe.slice().order(byteOrder); | ||
| // sweet, same buffer, we can return the buffer directly with position and limit set |
There was a problem hiding this comment.
this comment appears to no longer be 100% accurate
| // otherwise, use compressed or generic indexed based serializer | ||
| CompressionStrategy strategy = indexSpec.getComplexMetricCompression(); | ||
| if (strategy == null || CompressionStrategy.NONE == strategy || CompressionStrategy.UNCOMPRESSED == strategy) { | ||
| return LargeColumnSupportedComplexColumnSerializer.create( |
There was a problem hiding this comment.
The old code had a default of ComplexColumnSerializer. Why change that? (Is it strictly better? Are there compatibility concerns?)
There was a problem hiding this comment.
LargeColumnSupportedComplexColumnSerializer is basically identical to ComplexColumnSerializer, they both use GenericIndexedWriter with a filename of StringUtils.format("%s.complex_column", filenameBase), the main difference is that LargeColumnSupportedComplexColumnSerializer passes the FileSmoosher through to the writer allowing it to write v2 GenericIndexed which has the multi file support. So I think there shouldn't be any compatibility concerns because on the read side since both of these just use GenericIndexed.read
| { | ||
| if (!closedForWrite) { | ||
| closedForWrite = true; | ||
| ByteArrayOutputStream baos = new ByteArrayOutputStream(); |
There was a problem hiding this comment.
yea, noticed while poking around other stuff
| channel.write(ByteBuffer.wrap(new byte[]{V0})); | ||
| channel.write(ByteBuffer.wrap(metadataBytes)); | ||
|
|
||
| NestedCommonFormatColumnSerializer.writeInternal(smoosher, writer, name, FILE_NAME); |
There was a problem hiding this comment.
perhaps move that writeInternal to a more neutral helper class? It seems odd for the complex column serializer to call into an internal method of nested column serializer. I mean, it's fine, just a little odd.
There was a problem hiding this comment.
yea, fair.
I think I would like NestedCommonFormatColumnSerializer itself to be a more neutral thing since i think the way it splits up parts into separate smoosh files is advantageous if we implement stuff like partial segment download, though I'd like to make some adjustments to smoosh metadata format to eliminate the chance for name collisions, though definitely not going to do any of that in this PR.
gianm
left a comment
There was a problem hiding this comment.
LGTM after the latest changes.
changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`. * add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required * add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway
changes: * Adds new `CompressedComplexColumn`, `CompressedComplexColumnSerializer`, `CompressedComplexColumnSupplier` based on `CompressedVariableSizedBlobColumn` used by JSON columns * Adds `IndexSpec.complexMetricCompression` which can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible. * Adds new definition of `ComplexMetricSerde.getSerializer` which accepts an `IndexSpec` argument when creating a serializer. The old signature has been marked `@Deprecated` and has a default implementation that returns `null`, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use a `CompressedComplexColumnSerializer` if `IndexSpec.complexMetricCompression` is not null/none/uncompressed, or will use `LargeColumnSupportedComplexColumnSerializer` otherwise. * Removed all duplicate generic implementations of `ComplexMetricSerde.getSerializer` and `ComplexMetricSerde.deserializeColumn` into default implementations `ComplexMetricSerde` instead of being copied all over the place. The default implementation of `deserializeColumn` will check if the first byte indicates that the new compression was used, otherwise will use the `GenericIndexed` based supplier. * Complex columns with custom serializers/deserializers are unaffected and may continue doing whatever it is they do, either with specialized compression or whatever else, this new stuff is just to provide generic implementations built around `ObjectStrategy`. * add ObjectStrategy.readRetainsBufferReference so CompressedComplexColumn only copies on read if required * add copyValueOnRead flag down to CompressedBlockReader to avoid buffer duplicate if the value needs copied anyway
Description
Example wikipedia with users as DS HLL, added, deleted, delta, deltaBucket, commentLength as DS quantiles:

Example wikipedia with users and comment as DS HLL, page as DS Theta:

The performance measurements were not as much slower compared to uncompressed as I was expecting. For the following queries:
non-vectorized:
vectorized:
with segment sizes:
The benchmarks are not present in this PR, i've done some pretty heavy refactoring of the sql benchmarks so will add them in a follow-up PR.
changes:
CompressedComplexColumn,CompressedComplexColumnSerializer,CompressedComplexColumnSupplierbased onCompressedVariableSizedBlobColumnused by JSON columnsIndexSpec.complexMetricCompressionwhich can be used to specify compression for the generic compressed complex column. Defaults to uncompressed because compressed columns are not backwards compatible.ComplexMetricSerde.getSerializerwhich accepts anIndexSpecargument when creating a serializer. The old signature has been marked@Deprecatedand has a default implementation that returnsnull, but it will be used by the default implementation of the new version if it is implemented to return a non-null value. The default implementation of the new method will use aCompressedComplexColumnSerializerifIndexSpec.complexMetricCompressionis not null/none/uncompressed, or will useLargeColumnSupportedComplexColumnSerializerotherwise.ComplexMetricSerde.getSerializerandComplexMetricSerde.deserializeColumninto default implementationsComplexMetricSerdeinstead of being copied all over the place. The default implementation ofdeserializeColumnwill check if the first byte indicates that the new compression was used, otherwise will use theGenericIndexedbased supplier.ObjectStrategy. This should not preclude further specializing specific complex types in the future, this is just providing a generic base way to have compression to save some space.Release note
Compression is now available for all "complex" metric columns which do not have specialized implementations through a new
IndexSpecoption,complexMetricCompression, which defaults to uncompressed for backwards compatibility, but can be configured to any compression strategy (lz4, zstd, etc). This works for most complex columns except for compressed-big-decimal, and the columns stored by first/last aggregators.Note that enabling compression is not backwards compatible with Druid versions older than 31, so only enable this functionality once certain there is no need to roll-back to an older Druid version.
This PR has: