Skip to content

add NumericRangeIndex interface and BoundFilter support#12830

Merged
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:json-numeric-index
Jul 30, 2022
Merged

add NumericRangeIndex interface and BoundFilter support#12830
clintropolis merged 4 commits intoapache:masterfrom
clintropolis:json-numeric-index

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Jul 28, 2022

Description

This PR adds a NumericRangeIndex interface, which is like LexicographicalRangeIndex but for numbers. BoundFilter now uses NumericRangeIndex if comparator is numeric and there is no extractionFn defined, allowing number columns with indexes to take the same shortcut we use for string columns.

I have wired this up to COMPLEX<json> columns since the nested numeric columns have bitmap indexes, which has a pretty decent performance boost compared to using the predicate based index for numeric ranges. Note that regular numeric columns do not yet have indexes yet, which I plan to add in a future PR.

no index (regular numeric columns):

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlNestedDataBenchmark.querySql       14           5000000        false  avgt    5   97.164 ± 3.767  ms/op
SqlNestedDataBenchmark.querySql       14           5000000        force  avgt    5   53.835 ± 0.783  ms/op
SqlNestedDataBenchmark.querySql       26           5000000        false  avgt    5   75.090 ± 2.488  ms/op
SqlNestedDataBenchmark.querySql       26           5000000        force  avgt    5   48.634 ± 1.097  ms/op
SqlNestedDataBenchmark.querySql       28           5000000        false  avgt    5   86.636 ± 3.613  ms/op
SqlNestedDataBenchmark.querySql       28           5000000        force  avgt    5   52.267 ± 0.978  ms/op

predicate index (equivalent nested numeric columns with identical data):

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlNestedDataBenchmark.querySql       15           5000000        false  avgt    5  552.875 ± 6.711  ms/op
SqlNestedDataBenchmark.querySql       15           5000000        force  avgt    5  543.700 ± 6.557  ms/op
SqlNestedDataBenchmark.querySql       27           5000000        false  avgt    5   83.298 ± 1.470  ms/op
SqlNestedDataBenchmark.querySql       27           5000000        force  avgt    5   83.296 ± 1.646  ms/op
SqlNestedDataBenchmark.querySql       29           5000000        false  avgt    5  129.794 ± 2.851  ms/op
SqlNestedDataBenchmark.querySql       29           5000000        force  avgt    5  108.845 ± 2.076  ms/op

range index (equivalent nested numeric columns with identical data):

Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       15           5000000        false  avgt    5  386.536 ± 14.620  ms/op
SqlNestedDataBenchmark.querySql       15           5000000        force  avgt    5  219.167 ±  4.013  ms/op
SqlNestedDataBenchmark.querySql       27           5000000        false  avgt    5   13.366 ±  0.596  ms/op
SqlNestedDataBenchmark.querySql       27           5000000        force  avgt    5   13.465 ±  0.623  ms/op
SqlNestedDataBenchmark.querySql       29           5000000        false  avgt    5   42.618 ±  0.971  ms/op
SqlNestedDataBenchmark.querySql       29           5000000        force  avgt    5   41.069 ±  0.896  ms/op

Revisiting the initial benchmarks I ran for nested columns, #12753 (comment), query '15' was significantly slower than '14' which had no indexes, and this is still true though to a lesser degree than before. Digging a bit deeper, the reason for this is that 2.6 million values match which means merging that many bitmaps. Query 27 has ~1500 matches, while query 29 has 270k matching values.

This indicates that there is some threshold of "too many bitmaps" at which we should skip using indexes and fall back to a full scan and value matchers. I plan to introduce a mechanism for this in a future PR.

I also added some direct testing for NestedFieldLiteralColumnIndexSupplier, which was previously only indirectly tested via queries. This was a bit tedious, but also fixed a couple of minor bugs.


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

changes:
* NumericRangeIndex interface, like LexicographicalRangeIndex but for numbers
* BoundFilter now uses NumericRangeIndex if comparator is numeric and there is no extractionFn
* NestedFieldLiteralColumnIndexSupplier.java now supports supplying NumericRangeIndex for single typed numeric nested literal columns
boundDimFilter.getUpper(),
boundDimFilter.isUpperStrict()
);
// preserve sad backwards compatible behavior where bound filter matches 'null' if the lower bound is not set
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be refactor outside of the if as it is common to both the String and the Numeric

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they just look really similar right now, but the range indexes don't share a common interface currently so isn't super easy

return Filters.makeNullIndex(doesMatchNull(), selector);
}
final LexicographicalRangeIndex rangeIndex = indexSupplier.as(LexicographicalRangeIndex.class);
if (rangeIndex == null) {
Copy link
Copy Markdown
Contributor

@maytasm maytasm Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug? (returning null)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a bug with single type numeric nested literal columns and potentially future columns that only have some of these index types, but not STRING typed columns which have both lexicographical range indexes and predicate indexes.

@clintropolis
Copy link
Copy Markdown
Member Author

I've ended up reworking how the ranges are computed to be much more sane and efficient which has sped up the process quite a lot. I've updated the PR description with the new results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants