add new typed in filter#16039
Conversation
changes: * adds TypedInFilter which preserves matching sets in the native match value type * SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)
| return null; | ||
| } else if (arrayElements.length == 1) { | ||
| if (plannerContext.isUseBoundsAndSelectors()) { | ||
| if (plannerContext.isUseBoundsAndSelectors() || (!simpleExtractionExpr.isDirectColumnAccess() && virtualColumnRegistry == null)) { |
There was a problem hiding this comment.
when is virtualColumnRegistry null?
& if it can be null, then I think CodeQL makes a good point: could it be null in the else branch as well, where there is no null guard?
There was a problem hiding this comment.
it looks like it can be null in some join cases, https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidJoinQueryRel.java#L497 calls it with null. will add a guard
There was a problem hiding this comment.
oh wait, the else branch here only uses virtual column registry if its not a direct column access, so the registry cannot be null or else it would have made a selector filter with an extractionFn
There was a problem hiding this comment.
👍 in that case a comment would be useful? it seemed to confuse both us and CodeQL
There was a problem hiding this comment.
ok, reworked this a bit, if the virtual column registry is null it seems equivalent to just use an expression filter rather than forcing a selector filter + extractionFn, so I did that for now, since it seems better to do that and get away from reliance on using extractionFn defined on filters
| @Nullable | ||
| BitmapColumnIndex forSortedValues(@Nonnull List<?> sortedValues, TypeSignature<ValueType> matchValueType); | ||
|
|
||
| static <T> BitmapColumnIndex getIndexFromSortedIteratorSortedMerged( |
There was a problem hiding this comment.
Javadoc for this and getIndexFromSortedIterator? For some reason, the names don't make it obvious to me what they do. (One is regular and one is SortedMerged?)
Reading the code, it looks like this one does a linear scan through the dictionary, and the other one does a series of binary searches. Maybe call them getIndexFromSortedIteratorWithScan and getIndexFromSortedIteratorWithBinarySearch.
Oh, now I see why it's called SortedMerged. The zipping is kind of like the merge step of a merge-sort. Still, I feel the "scan" and "binary search" names would make more sense. And Javadoc would definitely help.
There was a problem hiding this comment.
there are 3 total methods, the sorted merge (getIndexFromSortedIteratorSortedMerged), the binary search with short-circuit when the values are sorted the same way as the dictionary and we get pas the end of the dictionary (getIndexFromSortedIterator), and the binary search when the values are not necessarily sorted the same way at the dictionary and so we must iterate them all (getIndexFromIterator).
Will add javadocs and try to give better names
There was a problem hiding this comment.
renamed stuff and added javadocs to explain the relation between these three functions and when to use them
| }; | ||
| } | ||
|
|
||
| static <T> BitmapColumnIndex getIndexFromIterator( |
There was a problem hiding this comment.
Javadoc would be useful for this one too.
| }; | ||
|
|
||
| // values are doubles and ordered in double order | ||
| if (matchValueType.is(ValueType.DOUBLE)) { |
There was a problem hiding this comment.
Is it possible to minimize the copied code between here and the LONG version, such as by using a shared helper somewhere?
There was a problem hiding this comment.
yea probably, also for the other index implementations.. i can look into it since has been on my todo list for a while
There was a problem hiding this comment.
i guess the downside to doing this is that i was hoping at some point to specialize the value dictionaries for primitive values (e.g. #12846) and combining them would make that not very easy since would need a generic parameter instead of java primitive.
There was a problem hiding this comment.
i have not done this change yet, still thinking about it
There was a problem hiding this comment.
I am okay with what you decide here. I can see both ways making sense.
| import java.util.Set; | ||
| import java.util.SortedSet; | ||
|
|
||
| public class TypedInFilter extends AbstractOptimizableDimFilter implements Filter |
There was a problem hiding this comment.
Some Javadoc for this and InDimFilter that point to each other would be good. People new to this area of the code base will need to be informed that both exist.
There was a problem hiding this comment.
added, also added recommendations to InDimFilter, SelectorDimFilter, and BoundDimFilter to use TypedInFilter, EqualityFilter, NullFilter, and RangeFilter instead
| } | ||
| if (matchValueType.is(ValueType.STRING)) { | ||
| this.lazyMatchValueBytes = Suppliers.memoize(() -> { | ||
| final SortedSet<ByteBuffer> matchValueBytes = new ObjectAVLTreeSet<>(ByteBufferUtils.utf8Comparator()); |
There was a problem hiding this comment.
Couldn't this be a List rather than a tree set? The values are already in the right order thanks to lazyMatchValues.get().
There was a problem hiding this comment.
yea, i would need to change the interface of Utf8ValueSetIndexes to not require a sorted set, but can go ahead and do that
There was a problem hiding this comment.
changed Utf8ValueSetIndexes to use a List now instead of a sortedset, which required some changes in InDimFilter too but i think is ok since the valueset there also ensures its ordered when it is created
| Object coerced = coerceValue(array[i], matchValueType); | ||
| //noinspection ObjectEquality | ||
| if (coerced != null && array[i] == coerced) { | ||
| // assume list is all same type objects... |
There was a problem hiding this comment.
Is this going to cause issues when reading JSON that has values set rather than sortedValues? (Where they might not all be the same type)
Also, do we care? I suppose if we aren't documenting this, then there's no way we'd get JSON with values in it. The serializer always writes sortedValues. You know, though, if this is the thinking, then we should really also split the constructor into two: a @JsonCreator that takes sortedValues only (not values) and a non-Jackson-enabled constructor that accepts possibly-unsorted values. No sense in supporting values in the JSON if we aren't going to use it.
There was a problem hiding this comment.
hmm, yea I guess i could always coerce while sorting just to be safe, was trying to short-circuit if it looked like it wasnt going to be necessary and was just assuming people wouldn't troll the set with mixed type junk like strings and numbers over json (and a mix of like ints and longs wouldn't really be a problem because the column type comparators handle the stuff as numbers rather than exact types to guard against json serde shenanigans)
i guess i was still on the fence of whether or not we document. we did document the other new native typed filters, so was sort of assuming we would this one too, especially since there is a performance improvement
There was a problem hiding this comment.
added coercion while sorting just in case
| this.unsortedValues = null; | ||
| this.lazyMatchValues = () -> sortedValues; | ||
| } else { | ||
| if (checkSorted(values, matchValueType)) { |
There was a problem hiding this comment.
Do we need to check for duplicates too, or are duplicates ok in lazyMatchValues? Similar question for sortValues: does it need to dedupe or is just sorting ok?
There was a problem hiding this comment.
oops, checkSorted actually does mark it as not sorted if there are dupe values, but sortValues method used to dedupe and now it isn't, i should probably fix that, was lost when i switched to using ObjectArrays.quickSort.
that said, dupes wouldn't really break anything, just inefficient
There was a problem hiding this comment.
i went back to using a sorted set to order and dedupe since I could think of a very efficient way to do this otherwise other than creating my own modified version of quickSort that also dedupes
There was a problem hiding this comment.
Hmm, dedupe after quicksort is O(n); probably wouldn't notice it much next to the quicksort. It might even be faster to do quicksort + dedupe pass vs. doing tree-sort. Although, this probably doesn't matter that much, since it's only happening once per query. The sorted set approach is IMO OK for this patch, esp. since that's the same thing the in filter is doing.
There was a problem hiding this comment.
yea, i toyed with a bit but didn't spend anytime measuring anything yet. I think the main thing that made me switch was the addition of the value coercion, with it too we iterate list once to make them all be the matchValueType, then the sort, then again to dedupe, so it seemed like maybe i should just use the set and coerce on insert and then turn that into a list.
Now i'm a bit curious though, so will do some measurements.
There was a problem hiding this comment.
ok, so got around to measuring it and went back to quicksort, even with coercion and dedupe is still better on bigger sets
set:
Benchmark (filterSize) Mode Cnt Score Error Units
InFilterBenchmark.sortFilterValues 1 avgt 3 0.021 ± 0.006 us/op
InFilterBenchmark.sortFilterValues 10 avgt 3 0.323 ± 0.035 us/op
InFilterBenchmark.sortFilterValues 100 avgt 3 4.521 ± 0.483 us/op
InFilterBenchmark.sortFilterValues 1000 avgt 3 104.984 ± 11.389 us/op
InFilterBenchmark.sortFilterValues 10000 avgt 3 1906.647 ± 922.912 us/op
quicksort:
Benchmark (filterSize) Mode Cnt Score Error Units
InFilterBenchmark.sortFilterValues 1 avgt 3 0.021 ± 0.005 us/op
InFilterBenchmark.sortFilterValues 10 avgt 3 0.574 ± 0.083 us/op
InFilterBenchmark.sortFilterValues 100 avgt 3 4.678 ± 1.100 us/op
InFilterBenchmark.sortFilterValues 1000 avgt 3 84.306 ± 10.615 us/op
InFilterBenchmark.sortFilterValues 10000 avgt 3 1555.324 ± 228.895 us/op
(I didn't add this benchmark to the PR)
| } | ||
|
|
||
| private static DruidObjectPredicate<String> createStringPredicate( | ||
| final List sortedValues, |
There was a problem hiding this comment.
I usually prefer List<?> if it works ok, because it at least keeps the unknownness to the single type param rather than encouraging the dropping of all type params.
| import java.util.List; | ||
|
|
||
| @RunWith(Enclosed.class) | ||
| public class TypedInFilterTests |
There was a problem hiding this comment.
Is there a way to write a test class that subclasses InFilterTest so it automatically gets any new tests we add there? Maybe make inFilter a protected abstract method rather than a static. The copied tests here make it tough to properly maintain the test suite.
There was a problem hiding this comment.
hmm, the new filter doesn't actually support default value mode though, so need to think of a way to handle that (as well as extractionFn tests). the other thing i could think of that might be a problem is that converting doubles to strings is sometimes ugly, so might still need to override some of those tests anyway depending on the mode
in the worst case i'll add javadoc comment to that test, and other classic filter tests that anything added there should be considered to also be added to the strongly typed replacement/equivalents tests
There was a problem hiding this comment.
ok, consolidated these tests. its kind of ugly, but i think ok
abhishekagarwal87
left a comment
There was a problem hiding this comment.
My review is in-complete but posting it anyway since I might not get to the rest of the PR.
| <groupId>com.amazonaws</groupId> | ||
| <artifactId>aws-java-sdk-bundle</artifactId> | ||
| <version>${aws.sdk.version}</version> | ||
| <scope>provided</scope> |
There was a problem hiding this comment.
comment is outdated, iirc this was an accident from experimenting with something else in the wrong branch
| static final byte RANGE_CACHE_ID = 0x14; | ||
| static final byte IS_FILTER_BOOLEAN_FILTER_CACHE_ID = 0x15; | ||
| static final byte ARRAY_CONTAINS_CACHE_ID = 0x16; | ||
| static final byte NEW_IN_CACHE_ID = 0x17; |
There was a problem hiding this comment.
| static final byte NEW_IN_CACHE_ID = 0x17; | |
| static final byte TYPED_IN_CACHE_ID = 0x17; |
| * Creates a new filter. | ||
| * | ||
| * @param column column to search | ||
| * @param values set of values to match. This collection may be reused to avoid copying a big collection. |
There was a problem hiding this comment.
Missing param for sortedValues.
| throw InvalidInput.exception("Invalid IN filter on column [%s], matchValueType cannot be null", column); | ||
| } | ||
| // one of sorted or not sorted | ||
| if (values == null && sortedValues == null) { |
There was a problem hiding this comment.
should you also check that both of them are not null either
There was a problem hiding this comment.
yea, might as well with the way things are now
| } | ||
| RangeSet<String> retSet = TreeRangeSet.create(); | ||
| for (Object value : lazyMatchValues.get()) { | ||
| String valueEquivalent = NullHandling.nullToEmptyIfNeeded(Evals.asString(value)); |
There was a problem hiding this comment.
since this filter only works in sql compatible mode, why wrap inside NullHandling.nullToEmptyIfNeeded?
There was a problem hiding this comment.
stale code from when i was originally aspiring to support both modes, will remove
| column | ||
| ); | ||
| } | ||
| if (sortedValues != null) { |
There was a problem hiding this comment.
what do you think of validating that sortedValues are indeed sorted?
There was a problem hiding this comment.
im against it, it defeats the purpose of keeping them split, which is to not do extra work unless we need to. if people want to manually craft json that is broken then that's on them. if we document this filter i'll be sure to make it super clear that if this is set that it must be sorted exactly the same as the matchValueType, or maybe just not document that this property exists at all
There was a problem hiding this comment.
added javadocs to hopefully better explain inner workings
There was a problem hiding this comment.
probably best to document it, since it will show up in explain plan from SQL, and people will wonder what it means.
| } | ||
| return false; | ||
| } | ||
| return o1.size() == o2.size(); |
There was a problem hiding this comment.
could be done before the loop to make the comparison faster.
| import java.util.List; | ||
| import java.util.NoSuchElementException; | ||
|
|
||
| public interface ValueSetIndexes |
There was a problem hiding this comment.
the methods that callers use have javadocs which is consistent with all of these other interfaces.. but will maybe fill them in with some generic bit about getting BitmapColumnIndex for sets of values or .. something along with other interfaces
| String valueEquivalent = Evals.asString(value); | ||
| if (valueEquivalent == null) { | ||
| // Case when SQL compatible null handling is enabled | ||
| // Range.singleton(null) is invalid, so use the fact that |
There was a problem hiding this comment.
Most of this comment is still useful in explaining why we're doing lessThan("").
|
|
||
| /** | ||
| * Supplier for list of values sorted by {@link #matchValueType}. This is lazily computed if | ||
| * {@link #unsortedValues} is not null and previously sorted. |
There was a problem hiding this comment.
would be useful to include a comment about whether this can contain duplicates or not, and if duplicates are not meant to be here, then also include a test that validates that deduplication happens on values (even if they are provided in sorted order).
| }; | ||
|
|
||
| // values are doubles and ordered in double order | ||
| if (matchValueType.is(ValueType.DOUBLE)) { |
There was a problem hiding this comment.
I am okay with what you decide here. I can see both ways making sense.
gianm
left a comment
There was a problem hiding this comment.
Approving since my remaining comments are about comments and optional refactors.
Description
Adds a new
TypedInFiltersimilar to the work done for equality and range filters in #14542, as a replacement forInDimFilterwhich deals in native match value types instead of only supporting string sets. This results in a pretty decent performance increase particularly for matching numeric columns, both when using indexes and matchers, since we don't need to convert the values to do the comparison when the filter match value type matches the underlying column type.The SQL planner uses new
TypedInFilterwheneversqlUseBoundAndSelectorsis set to false, which is itself by default tied todruid.generic.useDefaultValueForNull=false(also the default). This filter does not support default value mode at all, so ifdruid.generic.useDefaultValueForNull=truethen this filter will not be used even ifsqlUseBoundAndSelectorsis set to false.The JSON creator for
TypedInFilter:accept either
valueswhich may or may not be sorted (the constructor checks), andsortedValueswhich is trusted to be sorted. Nearly all callers should just setvaluesunless presorting the values, and if the values are already sorted then the O(n) scan of them to check for sortedness should be relatively cheap and will automatically move them tosortedValues. The sorting if needed is done lazily, only when doing things like json serialization, computing cache key, or checking if filters are equivalent. The idea is that the values can be sorted by the broker and then be guaranteed to be sorted when serialized to send to the historicals so that they don't have to waste any effort sorting the values.Unfortunately the SQL planner does serialize these filters when explaining queries, which is the idea behind checking if the values are pre-sorted (and also the same type as the matchValueType parameter) in the constructor, as much of the SQL planning will actually have the values already sorted, the only cases which are not are things like join and lookup rewrites. Hopefully this will be a bit cheaper than building the sorted set that
InDimFilteruses.Speaking of sorted sets, this new filter uses plain java lists instead of a sorted set, and then
Collections.binarySearchto locate items within them. Since these value sets are immutable by the time we need to find things in them, the overhead of sorted sets doesn't really bring much to the table, so this doing this binary search instead is slightly faster as well.IN filters on numeric columns:
IN filters on string columns:
The string results are pretty close since most of the internals are identical, though still shows a slight improvement from using the sorted list instead of sorted set.
Release note
TBD
This PR has: