Configurable bitmap index type for numeric fields in nested data columns#18722
Configurable bitmap index type for numeric fields in nested data columns#18722cecemei merged 22 commits intoapache:masterfrom
Conversation
| public void testIngestAndScanSegmentsRollup() throws Exception | ||
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsWithSpec(String name, boolean auto, NestedCommonFormatColumnFormatSpec spec) |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsTsv(String name, NestedCommonFormatColumnFormatSpec spec) throws Exception | ||
| public void testIngestAndScanSegmentsTsv(String name, boolean auto, NestedCommonFormatColumnFormatSpec spec) |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| public void testIngestAndScanSegmentsAndFilter() throws Exception | ||
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsAndFilter(String name, boolean auto, NestedCommonFormatColumnFormatSpec spec) |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsAndRangeFilter( | ||
| String name, |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsRealtimeAutoExplicit( | ||
| String name, |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsAndFilterPartialPathArrayIndex( | ||
| String name, |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsAndFilterPartialPath( | ||
| String name, |
Check notice
Code scanning / CodeQL
Useless parameter Note test
| @Parameters(method = "getNestedColumnFormatSpec") | ||
| @TestCaseName("{0}") | ||
| public void testIngestAndScanSegmentsNestedColumnNotNullFilter( | ||
| String name, |
Check notice
Code scanning / CodeQL
Useless parameter Note test
There was a problem hiding this comment.
Pull Request Overview
This PR introduces configurable bitmap index encoding strategies for numeric fields in nested data columns, allowing users to choose between full dictionary-based indexing and nulls-only indexing to optimize storage.
Key Changes:
- Added
BitmapIndexEncodingStrategyabstraction with two implementations:DictionaryId(full indexing) andNullsOnly(nulls-only indexing) - Updated
NestedCommonFormatColumnFormatSpecto includenumericFieldsBitmapIndexEncodingconfiguration - Refactored test utilities to use a new
SegmentBuilderpattern for cleaner test code
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| BitmapIndexEncodingStrategy.java | New abstraction defining strategies for encoding bitmap indexes |
| NestedCommonFormatColumnFormatSpec.java | Added numericFieldsBitmapIndexEncoding field and updated serialization |
| GlobalDictionaryEncodedFieldColumnWriter.java | Refactored to use configurable bitmap encoding strategy |
| ScalarLongFieldColumnWriter.java | Set bitmap encoding strategy from column format spec |
| ScalarDoubleFieldColumnWriter.java | Set bitmap encoding strategy from column format spec |
| CompressedNestedDataComplexColumn.java | Updated to use format spec for bitmap encoding decisions |
| NestedDataColumnSupplier.java | Changed to use format spec instead of bitmap serde factory |
| NestedDataColumnSupplierV4.java | Changed to use format spec instead of bitmap serde factory |
| NestedDataColumnV3.java | Changed parameter type from BitmapSerdeFactory to format spec |
| NestedDataColumnV4.java | Changed parameter type from BitmapSerdeFactory to format spec |
| NestedDataColumnV5.java | Changed parameter type from BitmapSerdeFactory to format spec |
| NestedCommonFormatColumnPartSerde.java | Updated FormatSpec to include numericFieldsBitmapIndex |
| VariantFieldColumnWriter.java | Removed redundant writeColumnTo method |
| VariantArrayFieldColumnWriter.java | Removed redundant writeColumnTo method |
| ScalarStringFieldColumnWriter.java | Removed redundant writeColumnTo method |
| NestedDataTestUtils.java | Refactored with new SegmentBuilder pattern for test data creation |
| NestedDataScanQueryTest.java | Updated tests to use SegmentBuilder and test new bitmap strategies |
| NestedDataColumnSchemaTest.java | Updated test to include bitmap encoding strategy |
| NestedDataColumnSupplierTest.java | Fixed parameter in test (bitmapSerdeFactory → columnFormatSpec) |
| NestedCommonFormatColumnFormatSpecTest.java | Added test coverage for numericFieldsBitmapIndexEncoding |
| BitmapIndexEncodingStrategyTest.java | New test file for bitmap encoding strategy serialization |
| BuiltInTypesModuleTest.java | Updated test to verify numericFieldsBitmapIndexEncoding configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private IndexSpec indexSpec = IndexSpec.getDefault(); | ||
|
|
||
| /** | ||
| * Builder for an {@link IncrementalIndexSegment} or a list of{@link QueryableIndexSegment}, with some defaults: |
There was a problem hiding this comment.
Missing space between 'of' and '{@link QueryableIndexSegment}'.
| * Builder for an {@link IncrementalIndexSegment} or a list of{@link QueryableIndexSegment}, with some defaults: | |
| * Builder for an {@link IncrementalIndexSegment} or a list of {@link QueryableIndexSegment}, with some defaults: |
| .build(); | ||
| Query<ScanResultValue> scanQuery = queryBuilder() | ||
| .columns("timestamp", "str", "double", "bool", "variant", | ||
| "variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays" |
There was a problem hiding this comment.
Corrected spelling of 'variantEmtpyArray' to 'variantEmptyArray'.
| "variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays" | |
| "variantNumeric", "variantEmptyObj", "variantEmptyArray", "variantWithArrays" |
| .build(); | ||
| Query<ScanResultValue> scanQuery = queryBuilder() | ||
| .columns("timestamp", "str", "double", "bool", "variant", | ||
| "variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays" |
There was a problem hiding this comment.
Corrected spelling of 'variantEmtpyArray' to 'variantEmptyArray'.
| "variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays" | |
| "variantNumeric", "variantEmptyObj", "variantEmptyArray", "variantWithArrays" |
…dCommonFormatColumnFormatSpec.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
BitmapIndexEncodingStrategy to control the bitmap encoding in a nested column
BitmapIndexEncodingStrategy to control the bitmap encoding in a nested column| @JsonCreator | ||
| public NestedCommonFormatColumnFormatSpec( | ||
| @JsonProperty("objectFieldsDictionaryEncoding") @Nullable StringEncodingStrategy objectFieldsDictionaryEncoding, | ||
| @JsonProperty("numericFieldsBitmapIndexEncoding") @Nullable BitmapIndexEncodingStrategy numericFieldsBitmapIndexEncoding, |
There was a problem hiding this comment.
this should be separately controllable for long and double fields, also I don't think 'encoding' is quite the correct thing to call this as the type of bitmap encoding itself is controlled by IndexSpec.
How about something like longFieldIndexType and doubleFieldIndexType since these are controlling the type of indexes we build for the field.
There was a problem hiding this comment.
i was thinking more granular controls by field names, with numeric as default. kinda feel like nothing in particular that double and long might be different? adding longFieldIndexType and doubleFieldIndexType is not too much work either, can do that if you have strong opinions on this.
rn i renamed to numericFieldsBitmapIndexType and BitmapIndexType.
There was a problem hiding this comment.
i was wanting long and double to be separate so we could support use cases where json has fields with long values that are used as dimensions but doubles as measures, before we implement per field customization. I think I was imagining per field customization would allow partial declaration, so that it could fall back to the per type default if an explicit configuration was not specified for a given field, so it would still be nice to be able to control them separately
There was a problem hiding this comment.
added the longFieldBitmapIndex and doubleFieldBitmapIndex config, looks like 134 bytes (for both config) per field increase in the overhead.
| } | ||
| } | ||
|
|
||
| public static class NullsOnly extends BitmapIndexEncodingStrategy |
There was a problem hiding this comment.
its kind of weird that the null index needs a GenericIndexedWriter to write only a single bitmap... I think these should just write the bitmap like other implementations of stuff that gets wired up to NullValueIndex, which just use a ByteBufferWriter to write the blob (see LongColumnSerializerV2, DoubleColumnSerializerV2, NestedDataColumnSerializer, etc)
I see why it is like this, so you could have some shared code, but I think as we add other types of indexes there will be less and less shared code and having an abstract base type isn't really the correct abstraction.
There was a problem hiding this comment.
are you thinking that the BitmapIndexType should handle deserialization? i thought about that but was not sure where the meta for other bitmap index type would be stored in, so i didint explore further.
There was a problem hiding this comment.
are you thinking that the BitmapIndexType should handle deserialization?
yes, but not in this PR, i have some changes in mind to do as a follow-up to overhaul these interfaces a bit.
Since the indexes are stored at the end of the buffer, it wouldn't be that much trouble to switch to using ByteBufferWriter to just write the single bitmap and be consistent with other numeric columns and save some space compared to the overhead of GenericIndexed for fields that only have null value indexes. For now i think could just swap the index handling logic in CompressedNestedDataComplexColumn.readNestedFieldColumn to handle the index type at the part where we currently do GenericIndexed.read, and use bitmapSerdeFactory.getObjectStrategy().fromByteBufferWithSize if only has null index and to use the NullValueIndexSupplier as the index supplier.
There was a problem hiding this comment.
updated to use ByteBufferWriter for null index, also realized that since the writer has state we cant really use BitmapIndexType directly so added a getWriter method
| System.out.println(noneObjectStorageFormatSize); | ||
| System.out.println(defaultFormatSize); | ||
| Assertions.assertTrue( | ||
| Integer.parseInt(noneObjectStorageFormatSize) <= Integer.parseInt(defaultFormatSize) * 0.8, |
Check notice
Code scanning / CodeQL
Missing catch of NumberFormatException Note test
| System.out.println(noneObjectStorageFormatSize); | ||
| System.out.println(defaultFormatSize); | ||
| Assertions.assertTrue( | ||
| Integer.parseInt(noneObjectStorageFormatSize) <= Integer.parseInt(defaultFormatSize) * 0.8, |
Check notice
Code scanning / CodeQL
Missing catch of NumberFormatException Note test
Description
BitmapIndexTypeabstraction with two implementations: DictionaryEncodedValueIndex (full indexing) and NullValueIndex (nulls-only indexing). It can be configured viaLongFieldBitmapIndexEncodingandDoubleFieldBitmapIndexEncodinginNestedCommonFormatColumnFormatSpec.NestedDataScanQueryTestto use parameterized testing for better coverage and maintainability, added aResourceFileSegmentBuilderclass to build segment for cleaner test code.GlobalDictionaryEncodedFieldColumnWriterto makegetSerializedColumnSizeandwriteColumnTosize match in the same class for consistency.Key changed/added classes in this PR
BitmapIndexTypeBitmapIndexType.DictionaryEncodedValueIndexBitmapIndexType.NullValueIndexNestedCommonFormatColumnFormatSpecNestedDataScanQueryTestThis PR has: