Conversation
changes: * introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation * introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>. * revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility * fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
6561769 to
2f93beb
Compare
| if (o instanceof Object[]) { | ||
| Object[] array = (Object[]) o; | ||
| if (elementNumber < array.length) { | ||
| return array[elementNumber]; |
Check failure
Code scanning / CodeQL
Improper validation of user-provided array index
| @JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = NestedDataDimensionSchema.class) | ||
| @JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = NestedDataDimensionSchema.class), | ||
| @JsonSubTypes.Type(name = StandardTypeColumnSchema.TYPE, value = StandardTypeColumnSchema.class), | ||
| @JsonSubTypes.Type(name = "auto", value = StandardTypeColumnSchema.class) |
There was a problem hiding this comment.
Why do we need this added as well? Shouldn't anything that is using this be able to use StandardTypeColumnSchema.TYPE when it persists such that we don't need to add 2 names for the same thing?
| default void mergeNestedFields(SortedMap<String, FieldTypeInfo.MutableTypeSet> mergedFields) | ||
| { | ||
| mergedFields.put( | ||
| NestedPathFinder.JSON_PATH_ROOT, | ||
| new FieldTypeInfo.MutableTypeSet().add(getColumnCapabilities().toColumnType()) | ||
| ); | ||
| } |
There was a problem hiding this comment.
This feel very nested-column-specific, but it's a change on the DimensionIndexer interface. Does it really need to be here? Can't DimensionIndexer instances make sure that they are all the same type and then use concrete methods instead of leaking something like this on the interface?
| return new CapabilitiesBasedFormat( | ||
| ColumnCapabilitiesImpl.snapshot( | ||
| getColumnCapabilities(), | ||
| CapabilitiesBasedFormat.DIMENSION_CAPABILITY_MERGE_LOGIC |
There was a problem hiding this comment.
A bit of a nit, but it seems really weird to have the CapabilitiesBasedFormat need to be given logic from itself? Maybe create a static CapabilitiesBasedFormat.snapshot(ColumnCapabilities) that does this on its own and keeps the implementation details local and private?
| handler.makeMerger( | ||
| indexSpec, | ||
| segmentWriteOutMedium, | ||
| dimCapabilities.get(i), | ||
| dimFormats.get(i).toColumnCapabilities(), | ||
| progress, | ||
| closer | ||
| ) |
There was a problem hiding this comment.
The handler came from the dimFormat, right? Why would it need the capabilities from dimFormat? Can we not eliminate the argument entirely?
There was a problem hiding this comment.
the snag here is that right now the string dimension merger needs the 'has multiple values' flag set on the capabilities to determine whether it makes a serializer for a single value or mutli value string.
I didn't want to rework the classic columns and indexers to use the ColumnFormat stuff at this time to minimize the number of changes, but the answer is yes, once we make a dedicate format for the String dimension schema i would think it would capture whether the string was multi-valued or not and we can drop this argument.
There was a problem hiding this comment.
I was thinking that because we are using CapabilitiesBasedFormat for all of the old ones that it would provide a nice clean plain to just start ignoring this parameter. Though, punting on it for later is probably also fine as even if we started ignoring it, making the actual interface change is less easy... Okay.
| ColumnDescriptor columnDesc = ColumnDescriptor | ||
| .builder() | ||
| .setValueType(dimCapabilities.get(i).getType()) | ||
| .setValueType(dimFormats.get(i).getLogicalType().getType()) | ||
| .addSerde(new NullColumnPartSerde(indexMergeResult.rowCount, indexSpec.getBitmapSerdeFactory())) | ||
| .build(); |
There was a problem hiding this comment.
Why do we need to store the type of the null column? Maybe we need a null type that indicates that it's always null and we can store it like that?
There was a problem hiding this comment.
this is from that explicit null value columns change we did a while back that saves any dimension schema in the segment that didn't have any values
I agree this can probably be better/smarter
| public void deserializeColumn( | ||
| @SuppressWarnings("unused") String columnName, | ||
| ByteBuffer buffer, | ||
| ColumnBuilder builder, | ||
| ColumnConfig columnConfig | ||
| ) | ||
| { | ||
| deserializeColumn(buffer, builder, columnConfig); | ||
| } |
There was a problem hiding this comment.
I don't believe that we need to add this columnName parameter. So far, the serde code has generally kept the name disassociated from the actual storage of the column as the name of the column is actually unimportant for the ability to serialize and deserialize the bytes.
It looks like this is being done because the column name is being used as a prefix on the other names in the file smoosher, that makes sense, but at this point it's not a "column name" as much as a "unique prefix". Given that it is a prefix that we expect this column to use on all of its things, I think it makes sense to serialize out the unique prefix as part of the bytes of the column itself and then read it back in from there. Let it be coincidence that the unique prefix just so happens to be the same thing as the column name.
This is pretty similar to how it was already working in the "older" versions where it got the information by deserializing a metadata object...
There was a problem hiding this comment.
removed from this interface since this wasn't needed for the v4 complex column and has been reverted since it can get the filename from its embedded metadata.
I have left the parameter on the ColumnPartSerde.Deserializer interface for now because the new nested column part serde does need it since it doesn't store a separate metadata file embedded in it and IndexIO.V9IndexLoader.deserializeColumn already has the interned column/filename passed in so i took advantage of this and pushed it down (instead of writing the column name again inside of the nested part serde or its data file). This feels like an internal interface so it doesn't seem that disruptive to change, but can do this other ways if really against it
| { | ||
| @Override | ||
| public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig) | ||
| public void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig) |
There was a problem hiding this comment.
Once we remove the argument from ComplexMetricSerde we don't need it here anymore either.
| @JsonSubTypes.Type(name = "doubleV2", value = DoubleNumericColumnPartSerdeV2.class), | ||
| @JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class) | ||
| @JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class), | ||
| @JsonSubTypes.Type(name = "standard", value = StandardTypeColumnPartSerde.class) |
There was a problem hiding this comment.
What're the semantics of the standard column part serde trying to do? If it's standard, does it subsume all of the others?
| void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig); | ||
| void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig); |
There was a problem hiding this comment.
Let's not add the extra parameter here.
| Assert.assertEquals(168, key.getEffectiveSizeBytes()); | ||
| Assert.assertEquals(6, indexer.getCardinality()); | ||
| Assert.assertEquals(276, key.getEffectiveSizeBytes()); | ||
| Assert.assertEquals(5, indexer.getCardinality()); |
There was a problem hiding this comment.
Is the different in numbers because the tests was validating v5, now it's validating v4?
There was a problem hiding this comment.
yeah, just rolling back to the old results https://github.com/apache/druid/pull/13803/files#diff-26f97e49d0f81a1797174646d89b826c3d175ddd7d4438e450fe03329cd030ffL97
| String name, | ||
| IndexSpec indexSpec, | ||
| SegmentWriteOutMedium segmentWriteOutMedium, | ||
| @SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
| String name, | ||
| IndexSpec indexSpec, | ||
| SegmentWriteOutMedium segmentWriteOutMedium, | ||
| @SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
| String name, | ||
| IndexSpec indexSpec, | ||
| SegmentWriteOutMedium segmentWriteOutMedium, | ||
| @SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
| String name, | ||
| IndexSpec indexSpec, | ||
| SegmentWriteOutMedium segmentWriteOutMedium, | ||
| @SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
| String name, | ||
| IndexSpec indexSpec, | ||
| SegmentWriteOutMedium segmentWriteOutMedium, | ||
| @SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
| // this should use: | ||
| // columnHolder.getColumnFormat().getColumnSchema(dimension) | ||
| // someday... |
There was a problem hiding this comment.
But it cannot because? (Please add to the comment)
| final SortedValueDictionary dimValues = mergable.getValueDictionary(); | ||
| mergable.mergeFieldsInto(mergedFields); |
There was a problem hiding this comment.
It's unclear if the dimValues dictionary would be mutated or not and if that's expected or not. I think that it's expected that it is not mutated. In which case, it might be nice to push mergable.mergeFieldsInto() down below the if (!allNulls){ } block just to make it clear that those values are being evaluated and grabbed before any merge occurs.
| */ | ||
|
|
||
| package org.apache.druid.segment.column; | ||
| package org.apache.druid.segment.nested; |
There was a problem hiding this comment.
Package nit: I wonder if it shouldn't be segment.column.nested? Probably doesn't really make that much of a difference, but seems more technically accurate.
There was a problem hiding this comment.
we don't have a segment.column.nested right now, i almost left it in segment.column since that also seems appropriate
| * Nested data column with optimized support for simple arrays. Not actually v5 in the segment since columns are now | ||
| * serialized using {@link org.apache.druid.segment.serde.NestedCommonFormatColumnPartSerde} instead of the generic | ||
| * complex type system. | ||
| * | ||
| * Not really stored in a segment as V5 since instead of V5 we migrated to {@link NestedCommonFormatColumn} which | ||
| * specializes physical format based on the types of data encountered during processing, and so versions are now | ||
| * {@link NestedCommonFormatColumnSerializer#V0} for all associated specializations. |
There was a problem hiding this comment.
I like the NestedCommon name that you picked for other stuff. It wouldn't bother me if you named this NestedCommon as well, fwiw. I'm also fine with the current naming, so more stream-of-consciousness thought than a request for change.
| /** | ||
| * Deserializes a ByteBuffer and adds it to the ColumnBuilder. This method allows for the ComplexMetricSerde | ||
| * to implement it's own versioning scheme to allow for changes of binary format in a forward-compatible manner. | ||
| * | ||
| * @param buffer the buffer to deserialize | ||
| * @param builder ColumnBuilder to add the column to | ||
| * @param columnConfig ColumnConfiguration used during deserialization | ||
| */ |
There was a problem hiding this comment.
Looks like the javadoc on this method got clobbered, probably by accident?
| // this shouldn't happen, but if it does, try to close to prevent a leak | ||
| try { | ||
| col.close(); | ||
| } | ||
| catch (IOException e) { | ||
| throw new RuntimeException(e); | ||
| } |
There was a problem hiding this comment.
A bit of a nit, but if you add this text to the exception message, you get the benefit of the comment explaining that you don't expect it to occur and if the exception ever does get thrown, the message is there too for whoever sees it.
| boolean allNulls = dimValues == null || dimValues.allNull(); | ||
| sortedLookup = dimValues; | ||
| if (!allNulls) { | ||
| mergable.mergeFieldsInto(mergedFields); |
There was a problem hiding this comment.
Just double checking, I had been thinking this was correct outside of the if statement because, even if it's all nulls, the others might have never seen a null and so you want to merge it so that null is properly registered. Is that not a concern?
There was a problem hiding this comment.
i realized that if it is all null this isn't doing anything useful anyway so i put it inside the if
Description
This PR reverts
NestedDataColumnIndexer,NestedDataColumnMerger,NestedDataColumnSerializerto their version pre #13803 behavior (v4) for backwards compatibility in favor of splitting the new functionality into a new"auto"dimension schema type to more accurately reflect the new behavior that this PR adds.This new 'auto type' indexer, merger, and an associated family of serializers is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either
STRING,ARRAY<STRING>,LONG,ARRAY<LONG>,DOUBLE,ARRAY<DOUBLE>, orCOMPLEX<json>columns, all sharing a common 'nested' format that allows segment merging to occur smoothly regardless of which physical format the column was written with.To accompany this, a new
ColumnFormatinterface has been defined to separate physical storage format from logical type.ColumnFormatis now used instead ofColumnCapabilitiesto get column handlers for segment creation. This allows both the v4 legacy columns and the new common format nested columns to both be able to map to theCOMPLEX<json>This PR also fixes a bug in
RoaringBitmapSerdeFactory, where if anything actually ever wrote out an empty bitmap usingtoBytesand then later tried to read it it would fail (the nerve!)Release note
A new "auto" type column schema and indexer has been added to native ingestion as the next logical iteration of the nested column functionality. This automatic type column indexer that produces the most appropriate column for the given inputs, producing either
STRING,ARRAY<STRING>,LONG,ARRAY<LONG>,DOUBLE,ARRAY<DOUBLE>, orCOMPLEX<json>columns, all sharing a common 'nested' format.All columns produced by 'auto' have indexes to aid in fast filtering (unlike classic
LONGandDOUBLEcolumns) and use cardinality based thresholds to attempt to only utilize these indexes when it is likely to actually speed up the query (unlike classicSTRINGcolumns).COMPLEX<json>columns produced by this 'auto' indexer store arrays of simple scalar types differently than their 'json' (v4) counterparts, storing them asARRAYtyped columns. This means that theJSON_VALUEfunction can now extract entire arrays, e.g.JSON_VALUE(nested, '$.array' RETURNING BIGINT ARRAY). There is no change with how arrays of complex objects are stored at this time.This improvement also adds a completely new functionality to Druid,
ARRAYtyped columns, which unlike classic multi-valueSTRINGcolumns behave withARRAYsemantics. These columns can currently only be created via the 'auto' type indexer when all values are an arrays with the same type of elements.Key changed/added classes in this PR
ColumnFormatThis PR has: