Skip to content

Configurable bitmap index type for numeric fields in nested data columns#18722

Merged
cecemei merged 22 commits intoapache:masterfrom
cecemei:bitmap
Dec 12, 2025
Merged

Configurable bitmap index type for numeric fields in nested data columns#18722
cecemei merged 22 commits intoapache:masterfrom
cecemei:bitmap

Conversation

@cecemei
Copy link
Copy Markdown
Contributor

@cecemei cecemei commented Nov 6, 2025

Description

  • Main change: users can now choose between full dictionary-based indexing and nulls-only indexing for long/double fields in nested column. This is done via a new BitmapIndexType abstraction with two implementations: DictionaryEncodedValueIndex (full indexing) and NullValueIndex (nulls-only indexing). It can be configured via LongFieldBitmapIndexEncoding and DoubleFieldBitmapIndexEncoding in NestedCommonFormatColumnFormatSpec.
  • Refactored NestedDataScanQueryTest to use parameterized testing for better coverage and maintainability, added a ResourceFileSegmentBuilder class to build segment for cleaner test code.
  • Minor update on GlobalDictionaryEncodedFieldColumnWriter to make getSerializedColumnSize and writeColumnTo size match in the same class for consistency.
Key changed/added classes in this PR
  • BitmapIndexType
  • BitmapIndexType.DictionaryEncodedValueIndex
  • BitmapIndexType.NullValueIndex
  • NestedCommonFormatColumnFormatSpec
  • NestedDataScanQueryTest

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

public void testIngestAndScanSegmentsRollup() throws Exception
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsWithSpec(String name, boolean auto, NestedCommonFormatColumnFormatSpec spec)

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
Comment thread processing/src/test/java/org/apache/druid/query/scan/NestedDataScanQueryTest.java Dismissed
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsTsv(String name, NestedCommonFormatColumnFormatSpec spec) throws Exception
public void testIngestAndScanSegmentsTsv(String name, boolean auto, NestedCommonFormatColumnFormatSpec spec)

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
public void testIngestAndScanSegmentsAndFilter() throws Exception
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsAndFilter(String name, boolean auto, NestedCommonFormatColumnFormatSpec spec)

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsAndRangeFilter(
String name,

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsRealtimeAutoExplicit(
String name,

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsAndFilterPartialPathArrayIndex(
String name,

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsAndFilterPartialPath(
String name,

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
@Parameters(method = "getNestedColumnFormatSpec")
@TestCaseName("{0}")
public void testIngestAndScanSegmentsNestedColumnNotNullFilter(
String name,

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
@cecemei cecemei requested a review from Copilot November 10, 2025 14:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces configurable bitmap index encoding strategies for numeric fields in nested data columns, allowing users to choose between full dictionary-based indexing and nulls-only indexing to optimize storage.

Key Changes:

  • Added BitmapIndexEncodingStrategy abstraction with two implementations: DictionaryId (full indexing) and NullsOnly (nulls-only indexing)
  • Updated NestedCommonFormatColumnFormatSpec to include numericFieldsBitmapIndexEncoding configuration
  • Refactored test utilities to use a new SegmentBuilder pattern for cleaner test code

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
BitmapIndexEncodingStrategy.java New abstraction defining strategies for encoding bitmap indexes
NestedCommonFormatColumnFormatSpec.java Added numericFieldsBitmapIndexEncoding field and updated serialization
GlobalDictionaryEncodedFieldColumnWriter.java Refactored to use configurable bitmap encoding strategy
ScalarLongFieldColumnWriter.java Set bitmap encoding strategy from column format spec
ScalarDoubleFieldColumnWriter.java Set bitmap encoding strategy from column format spec
CompressedNestedDataComplexColumn.java Updated to use format spec for bitmap encoding decisions
NestedDataColumnSupplier.java Changed to use format spec instead of bitmap serde factory
NestedDataColumnSupplierV4.java Changed to use format spec instead of bitmap serde factory
NestedDataColumnV3.java Changed parameter type from BitmapSerdeFactory to format spec
NestedDataColumnV4.java Changed parameter type from BitmapSerdeFactory to format spec
NestedDataColumnV5.java Changed parameter type from BitmapSerdeFactory to format spec
NestedCommonFormatColumnPartSerde.java Updated FormatSpec to include numericFieldsBitmapIndex
VariantFieldColumnWriter.java Removed redundant writeColumnTo method
VariantArrayFieldColumnWriter.java Removed redundant writeColumnTo method
ScalarStringFieldColumnWriter.java Removed redundant writeColumnTo method
NestedDataTestUtils.java Refactored with new SegmentBuilder pattern for test data creation
NestedDataScanQueryTest.java Updated tests to use SegmentBuilder and test new bitmap strategies
NestedDataColumnSchemaTest.java Updated test to include bitmap encoding strategy
NestedDataColumnSupplierTest.java Fixed parameter in test (bitmapSerdeFactory → columnFormatSpec)
NestedCommonFormatColumnFormatSpecTest.java Added test coverage for numericFieldsBitmapIndexEncoding
BitmapIndexEncodingStrategyTest.java New test file for bitmap encoding strategy serialization
BuiltInTypesModuleTest.java Updated test to verify numericFieldsBitmapIndexEncoding configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

private IndexSpec indexSpec = IndexSpec.getDefault();

/**
* Builder for an {@link IncrementalIndexSegment} or a list of{@link QueryableIndexSegment}, with some defaults:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space between 'of' and '{@link QueryableIndexSegment}'.

Suggested change
* Builder for an {@link IncrementalIndexSegment} or a list of{@link QueryableIndexSegment}, with some defaults:
* Builder for an {@link IncrementalIndexSegment} or a list of {@link QueryableIndexSegment}, with some defaults:

Copilot uses AI. Check for mistakes.
.build();
Query<ScanResultValue> scanQuery = queryBuilder()
.columns("timestamp", "str", "double", "bool", "variant",
"variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays"
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'variantEmtpyArray' to 'variantEmptyArray'.

Suggested change
"variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays"
"variantNumeric", "variantEmptyObj", "variantEmptyArray", "variantWithArrays"

Copilot uses AI. Check for mistakes.
.build();
Query<ScanResultValue> scanQuery = queryBuilder()
.columns("timestamp", "str", "double", "bool", "variant",
"variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays"
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'variantEmtpyArray' to 'variantEmptyArray'.

Suggested change
"variantNumeric", "variantEmptyObj", "variantEmtpyArray", "variantWithArrays"
"variantNumeric", "variantEmptyObj", "variantEmptyArray", "variantWithArrays"

Copilot uses AI. Check for mistakes.
…dCommonFormatColumnFormatSpec.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@cecemei cecemei marked this pull request as ready for review November 10, 2025 14:32
@cecemei cecemei changed the title index-strategy Add a new BitmapIndexEncodingStrategy to control the bitmap encoding in a nested column Nov 24, 2025
@cecemei cecemei changed the title Add a new BitmapIndexEncodingStrategy to control the bitmap encoding in a nested column Configurable bitmap index encoding strategies for numeric fields in nested data columns Nov 24, 2025
@cecemei cecemei requested a review from clintropolis November 25, 2025 20:01
Comment thread processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnV5.java Outdated
@JsonCreator
public NestedCommonFormatColumnFormatSpec(
@JsonProperty("objectFieldsDictionaryEncoding") @Nullable StringEncodingStrategy objectFieldsDictionaryEncoding,
@JsonProperty("numericFieldsBitmapIndexEncoding") @Nullable BitmapIndexEncodingStrategy numericFieldsBitmapIndexEncoding,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be separately controllable for long and double fields, also I don't think 'encoding' is quite the correct thing to call this as the type of bitmap encoding itself is controlled by IndexSpec.

How about something like longFieldIndexType and doubleFieldIndexType since these are controlling the type of indexes we build for the field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking more granular controls by field names, with numeric as default. kinda feel like nothing in particular that double and long might be different? adding longFieldIndexType and doubleFieldIndexType is not too much work either, can do that if you have strong opinions on this.

rn i renamed to numericFieldsBitmapIndexType and BitmapIndexType.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was wanting long and double to be separate so we could support use cases where json has fields with long values that are used as dimensions but doubles as measures, before we implement per field customization. I think I was imagining per field customization would allow partial declaration, so that it could fall back to the per type default if an explicit configuration was not specified for a given field, so it would still be nice to be able to control them separately

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the longFieldBitmapIndex and doubleFieldBitmapIndex config, looks like 134 bytes (for both config) per field increase in the overhead.

Comment thread processing/src/main/java/org/apache/druid/segment/column/BitmapIndexType.java Outdated
Comment thread processing/src/test/java/org/apache/druid/query/NestedDataTestUtils.java Outdated
}
}

public static class NullsOnly extends BitmapIndexEncodingStrategy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its kind of weird that the null index needs a GenericIndexedWriter to write only a single bitmap... I think these should just write the bitmap like other implementations of stuff that gets wired up to NullValueIndex, which just use a ByteBufferWriter to write the blob (see LongColumnSerializerV2, DoubleColumnSerializerV2, NestedDataColumnSerializer, etc)

I see why it is like this, so you could have some shared code, but I think as we add other types of indexes there will be less and less shared code and having an abstract base type isn't really the correct abstraction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you thinking that the BitmapIndexType should handle deserialization? i thought about that but was not sure where the meta for other bitmap index type would be stored in, so i didint explore further.

Copy link
Copy Markdown
Member

@clintropolis clintropolis Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you thinking that the BitmapIndexType should handle deserialization?

yes, but not in this PR, i have some changes in mind to do as a follow-up to overhaul these interfaces a bit.

Since the indexes are stored at the end of the buffer, it wouldn't be that much trouble to switch to using ByteBufferWriter to just write the single bitmap and be consistent with other numeric columns and save some space compared to the overhead of GenericIndexed for fields that only have null value indexes. For now i think could just swap the index handling logic in CompressedNestedDataComplexColumn.readNestedFieldColumn to handle the index type at the part where we currently do GenericIndexed.read, and use bitmapSerdeFactory.getObjectStrategy().fromByteBufferWithSize if only has null index and to use the NullValueIndexSupplier as the index supplier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use ByteBufferWriter for null index, also realized that since the writer has state we cant really use BitmapIndexType directly so added a getWriter method

Comment thread processing/src/main/java/org/apache/druid/segment/column/BitmapIndexType.java Outdated
Comment thread processing/src/test/java/org/apache/druid/guice/BuiltInTypesModuleTest.java Outdated
System.out.println(noneObjectStorageFormatSize);
System.out.println(defaultFormatSize);
Assertions.assertTrue(
Integer.parseInt(noneObjectStorageFormatSize) <= Integer.parseInt(defaultFormatSize) * 0.8,

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note test

Potential uncaught 'java.lang.NumberFormatException'.
System.out.println(noneObjectStorageFormatSize);
System.out.println(defaultFormatSize);
Assertions.assertTrue(
Integer.parseInt(noneObjectStorageFormatSize) <= Integer.parseInt(defaultFormatSize) * 0.8,

Check notice

Code scanning / CodeQL

Missing catch of NumberFormatException Note test

Potential uncaught 'java.lang.NumberFormatException'.
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks 🤘

@cecemei cecemei changed the title Configurable bitmap index encoding strategies for numeric fields in nested data columns Configurable bitmap index type for numeric fields in nested data columns Dec 12, 2025
@cecemei cecemei merged commit ccade1a into apache:master Dec 12, 2025
99 of 100 checks passed
@kgyrtkirk kgyrtkirk added this to the 36.0.0 milestone Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants