add NumericRangeIndex interface and BoundFilter support#12830
add NumericRangeIndex interface and BoundFilter support#12830clintropolis merged 4 commits intoapache:masterfrom
Conversation
changes: * NumericRangeIndex interface, like LexicographicalRangeIndex but for numbers * BoundFilter now uses NumericRangeIndex if comparator is numeric and there is no extractionFn * NestedFieldLiteralColumnIndexSupplier.java now supports supplying NumericRangeIndex for single typed numeric nested literal columns
| boundDimFilter.getUpper(), | ||
| boundDimFilter.isUpperStrict() | ||
| ); | ||
| // preserve sad backwards compatible behavior where bound filter matches 'null' if the lower bound is not set |
There was a problem hiding this comment.
I think this can be refactor outside of the if as it is common to both the String and the Numeric
There was a problem hiding this comment.
they just look really similar right now, but the range indexes don't share a common interface currently so isn't super easy
| return Filters.makeNullIndex(doesMatchNull(), selector); | ||
| } | ||
| final LexicographicalRangeIndex rangeIndex = indexSupplier.as(LexicographicalRangeIndex.class); | ||
| if (rangeIndex == null) { |
There was a problem hiding this comment.
was this a bug? (returning null)
There was a problem hiding this comment.
it was a bug with single type numeric nested literal columns and potentially future columns that only have some of these index types, but not STRING typed columns which have both lexicographical range indexes and predicate indexes.
|
I've ended up reworking how the ranges are computed to be much more sane and efficient which has sped up the process quite a lot. I've updated the PR description with the new results |
Description
This PR adds a
NumericRangeIndexinterface, which is likeLexicographicalRangeIndexbut for numbers.BoundFilternow usesNumericRangeIndexif comparator is numeric and there is noextractionFndefined, allowing number columns with indexes to take the same shortcut we use for string columns.I have wired this up to
COMPLEX<json>columns since the nested numeric columns have bitmap indexes, which has a pretty decent performance boost compared to using the predicate based index for numeric ranges. Note that regular numeric columns do not yet have indexes yet, which I plan to add in a future PR.no index (regular numeric columns):
predicate index (equivalent nested numeric columns with identical data):
range index (equivalent nested numeric columns with identical data):
Revisiting the initial benchmarks I ran for nested columns, #12753 (comment), query '15' was significantly slower than '14' which had no indexes, and this is still true though to a lesser degree than before. Digging a bit deeper, the reason for this is that 2.6 million values match which means merging that many bitmaps. Query 27 has ~1500 matches, while query 29 has 270k matching values.
This indicates that there is some threshold of "too many bitmaps" at which we should skip using indexes and fall back to a full scan and value matchers. I plan to introduce a mechanism for this in a future PR.
I also added some direct testing for
NestedFieldLiteralColumnIndexSupplier, which was previously only indirectly tested via queries. This was a bit tedious, but also fixed a couple of minor bugs.This PR has: