add new typed in filter by clintropolis · Pull Request #16039 · apache/druid

clintropolis · 2024-03-05T05:34:43Z

Description

Adds a new TypedInFilter similar to the work done for equality and range filters in #14542, as a replacement for InDimFilter which deals in native match value types instead of only supporting string sets. This results in a pretty decent performance increase particularly for matching numeric columns, both when using indexes and matchers, since we don't need to convert the values to do the comparison when the filter match value type matches the underlying column type.

The SQL planner uses new TypedInFilter whenever sqlUseBoundAndSelectors is set to false, which is itself by default tied to druid.generic.useDefaultValueForNull=false (also the default). This filter does not support default value mode at all, so if druid.generic.useDefaultValueForNull=true then this filter will not be used even if sqlUseBoundAndSelectors is set to false.

The JSON creator for TypedInFilter:

  @JsonCreator
  public TypedInFilter(
      @JsonProperty("column") String column,
      @JsonProperty("matchValueType") ColumnType matchValueType,
      @JsonProperty("values") @Nullable List<?> values,
      @JsonProperty("sortedValues") @Nullable List<?> sortedValues,
      @JsonProperty("filterTuning") @Nullable FilterTuning filterTuning
  )

accept either values which may or may not be sorted (the constructor checks), and sortedValues which is trusted to be sorted. Nearly all callers should just set values unless presorting the values, and if the values are already sorted then the O(n) scan of them to check for sortedness should be relatively cheap and will automatically move them to sortedValues. The sorting if needed is done lazily, only when doing things like json serialization, computing cache key, or checking if filters are equivalent. The idea is that the values can be sorted by the broker and then be guaranteed to be sorted when serialized to send to the historicals so that they don't have to waste any effort sorting the values.

Unfortunately the SQL planner does serialize these filters when explaining queries, which is the idea behind checking if the values are pre-sorted (and also the same type as the matchValueType parameter) in the constructor, as much of the SQL planning will actually have the values already sorted, the only cases which are not are things like join and lookup rewrites. Hopefully this will be a bit cheaper than building the sorted set that InDimFilter uses.

Speaking of sorted sets, this new filter uses plain java lists instead of a sorted set, and then Collections.binarySearch to locate items within them. Since these value sets are immutable by the time we need to find things in them, the overhead of sorted sets doesn't really bring much to the table, so this doing this binary search instead is slightly faster as well.

IN filters on numeric columns:

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        false  avgt    5  316.998 ±  2.726  ms/op
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        force  avgt    5  318.369 ±  3.914  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        false  avgt    5  171.867 ±  2.913  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        force  avgt    5  171.676 ±  2.216  ms/op
after:
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        false  avgt    5  265.012 ±  8.229  ms/op
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        force  avgt    5  273.842 ±  7.391  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        false  avgt    5  171.550 ±  1.633  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        force  avgt    5  173.540 ±  2.413  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        false  avgt    5  365.012 ±  6.052  ms/op
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        force  avgt    5  250.669 ±  3.304  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        false  avgt    5  218.247 ±  1.382  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        force  avgt    5  162.742 ±  4.829  ms/op
after:
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        false  avgt    5  307.309 ± 10.928  ms/op
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        force  avgt    5  237.739 ±  8.574  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        false  avgt    5  213.082 ±  3.638  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        force  avgt    5  155.719 ±  3.434  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        false  avgt    5  858.528 ±  6.713  ms/op
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        force  avgt    5  860.413 ± 10.305  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        false  avgt    5  146.786 ±  5.952  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        force  avgt    5  154.059 ±  8.155  ms/op
after:
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        false  avgt    5  260.688 ±  2.862  ms/op
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        force  avgt    5  256.196 ±  4.310  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        false  avgt    5  177.440 ±  1.757  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        force  avgt    5  147.445 ±  3.371  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        false  avgt    5  947.775 ± 12.952  ms/op
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        force  avgt    5  377.426 ± 10.417  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        false  avgt    5  220.622 ±  6.089  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        force  avgt    5  163.623 ±  1.712  ms/op
after:
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        false  avgt    5  325.515 ±  7.563  ms/op
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        force  avgt    5  255.146 ±  4.047  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        false  avgt    5  216.967 ±  7.754  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        force  avgt    5  157.717 ±  4.723  ms/op

"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        false  avgt    5  773.810 ±  7.200  ms/op
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        force  avgt    5  772.494 ±  3.224  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        false  avgt    5    2.169 ±  0.218  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        force  avgt    5    2.205 ±  0.217  ms/op
after:
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        false  avgt    5  108.259 ±  2.217  ms/op
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        force  avgt    5  108.179 ±  1.548  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        false  avgt    5    2.070 ±  0.144  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        force  avgt    5    2.019 ±  0.133  ms/op

"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        false  avgt    5  904.097 ±  9.795  ms/op
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        force  avgt    5  242.745 ±  5.692  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        false  avgt    5  128.160 ±  2.043  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        force  avgt    5  129.161 ±  5.200  ms/op
after:
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        false  avgt    5  230.324 ± 12.410  ms/op
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        force  avgt    5  201.999 ±  4.346  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        false  avgt    5  125.256 ± 18.262  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        force  avgt    5  123.719 ±  8.906  ms/op

IN filters on string columns:

SELECT * FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100')

before:
Benchmark              (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        false  avgt    5  272.734 ± 18.392  ms/op
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        force  avgt    5  272.081 ± 31.917  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        false  avgt    5  232.986 ± 14.621  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        force  avgt    5  229.473 ± 11.148  ms/op
after:
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        false  avgt    5  260.132 ±  5.194  ms/op
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        force  avgt    5  263.946 ±  4.753  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        false  avgt    5  228.280 ± 12.907  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        force  avgt    5  228.460 ± 10.062  ms/op


SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100') GROUP BY 1, 2

before:
Benchmark              (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        false  avgt    5   26.105 ±  2.186  ms/op
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        force  avgt    5   22.348 ±  1.196  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        false  avgt    5   25.959 ±  0.923  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        force  avgt    5   22.445 ±  1.647  ms/op
after:
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        false  avgt    5   25.170 ±  0.654  ms/op
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        force  avgt    5   21.514 ±  1.403  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        false  avgt    5   25.258 ±  0.501  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        force  avgt    5   21.564 ±  0.824  ms/op

The string results are pretty close since most of the internals are identical, though still shows a slight improvement from using the sorted list instead of sorted set.

Release note

TBD

This PR has:

changes: * adds TypedInFilter which preserves matching sets in the native match value type * SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)

gianm · 2024-03-19T18:29:08Z

        return null;
      } else if (arrayElements.length == 1) {
-        if (plannerContext.isUseBoundsAndSelectors()) {
+        if (plannerContext.isUseBoundsAndSelectors() || (!simpleExtractionExpr.isDirectColumnAccess() && virtualColumnRegistry == null)) {


when is virtualColumnRegistry null?

& if it can be null, then I think CodeQL makes a good point: could it be null in the else branch as well, where there is no null guard?

it looks like it can be null in some join cases, https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidJoinQueryRel.java#L497 calls it with null. will add a guard

oh wait, the else branch here only uses virtual column registry if its not a direct column access, so the registry cannot be null or else it would have made a selector filter with an extractionFn

👍 in that case a comment would be useful? it seemed to confuse both us and CodeQL

ok, reworked this a bit, if the virtual column registry is null it seems equivalent to just use an expression filter rather than forcing a selector filter + extractionFn, so I did that for now, since it seems better to do that and get away from reliance on using extractionFn defined on filters

gianm · 2024-03-20T02:48:22Z

+  @Nullable
+  BitmapColumnIndex forSortedValues(@Nonnull List<?> sortedValues, TypeSignature<ValueType> matchValueType);
+
+  static <T> BitmapColumnIndex getIndexFromSortedIteratorSortedMerged(


Javadoc for this and getIndexFromSortedIterator? For some reason, the names don't make it obvious to me what they do. (One is regular and one is SortedMerged?)

Reading the code, it looks like this one does a linear scan through the dictionary, and the other one does a series of binary searches. Maybe call them getIndexFromSortedIteratorWithScan and getIndexFromSortedIteratorWithBinarySearch.

Oh, now I see why it's called SortedMerged. The zipping is kind of like the merge step of a merge-sort. Still, I feel the "scan" and "binary search" names would make more sense. And Javadoc would definitely help.

there are 3 total methods, the sorted merge (getIndexFromSortedIteratorSortedMerged), the binary search with short-circuit when the values are sorted the same way as the dictionary and we get pas the end of the dictionary (getIndexFromSortedIterator), and the binary search when the values are not necessarily sorted the same way at the dictionary and so we must iterate them all (getIndexFromIterator).

Will add javadocs and try to give better names

renamed stuff and added javadocs to explain the relation between these three functions and when to use them

gianm · 2024-03-20T02:52:21Z

+    };
+  }
+
+  static <T> BitmapColumnIndex getIndexFromIterator(


Javadoc would be useful for this one too.

gianm · 2024-03-20T03:33:05Z

+      };
+
+      // values are doubles and ordered in double order
+      if (matchValueType.is(ValueType.DOUBLE)) {


Is it possible to minimize the copied code between here and the LONG version, such as by using a shared helper somewhere?

yea probably, also for the other index implementations.. i can look into it since has been on my todo list for a while

i guess the downside to doing this is that i was hoping at some point to specialize the value dictionaries for primitive values (e.g. #12846) and combining them would make that not very easy since would need a generic parameter instead of java primitive.

i have not done this change yet, still thinking about it

I am okay with what you decide here. I can see both ways making sense.

gianm · 2024-03-20T03:34:20Z

+import java.util.Set;
+import java.util.SortedSet;
+
+public class TypedInFilter extends AbstractOptimizableDimFilter implements Filter


Some Javadoc for this and InDimFilter that point to each other would be good. People new to this area of the code base will need to be informed that both exist.

added, also added recommendations to InDimFilter, SelectorDimFilter, and BoundDimFilter to use TypedInFilter, EqualityFilter, NullFilter, and RangeFilter instead

gianm · 2024-03-20T04:14:49Z

+    }
+    if (matchValueType.is(ValueType.STRING)) {
+      this.lazyMatchValueBytes = Suppliers.memoize(() -> {
+        final SortedSet<ByteBuffer> matchValueBytes = new ObjectAVLTreeSet<>(ByteBufferUtils.utf8Comparator());


Couldn't this be a List rather than a tree set? The values are already in the right order thanks to lazyMatchValues.get().

yea, i would need to change the interface of Utf8ValueSetIndexes to not require a sorted set, but can go ahead and do that

changed Utf8ValueSetIndexes to use a List now instead of a sortedset, which required some changes in InDimFilter too but i think is ok since the valueset there also ensures its ordered when it is created

gianm · 2024-03-20T04:30:09Z

+      Object coerced = coerceValue(array[i], matchValueType);
+      //noinspection ObjectEquality
+      if (coerced != null && array[i] == coerced) {
+        // assume list is all same type objects...


Is this going to cause issues when reading JSON that has values set rather than sortedValues? (Where they might not all be the same type)

Also, do we care? I suppose if we aren't documenting this, then there's no way we'd get JSON with values in it. The serializer always writes sortedValues. You know, though, if this is the thinking, then we should really also split the constructor into two: a @JsonCreator that takes sortedValues only (not values) and a non-Jackson-enabled constructor that accepts possibly-unsorted values. No sense in supporting values in the JSON if we aren't going to use it.

hmm, yea I guess i could always coerce while sorting just to be safe, was trying to short-circuit if it looked like it wasnt going to be necessary and was just assuming people wouldn't troll the set with mixed type junk like strings and numbers over json (and a mix of like ints and longs wouldn't really be a problem because the column type comparators handle the stuff as numbers rather than exact types to guard against json serde shenanigans)

i guess i was still on the fence of whether or not we document. we did document the other new native typed filters, so was sort of assuming we would this one too, especially since there is a performance improvement

added coercion while sorting just in case

gianm · 2024-03-20T04:33:11Z

+      this.unsortedValues = null;
+      this.lazyMatchValues = () -> sortedValues;
+    } else {
+      if (checkSorted(values, matchValueType)) {


Do we need to check for duplicates too, or are duplicates ok in lazyMatchValues? Similar question for sortValues: does it need to dedupe or is just sorting ok?

oops, checkSorted actually does mark it as not sorted if there are dupe values, but sortValues method used to dedupe and now it isn't, i should probably fix that, was lost when i switched to using ObjectArrays.quickSort.

that said, dupes wouldn't really break anything, just inefficient

i went back to using a sorted set to order and dedupe since I could think of a very efficient way to do this otherwise other than creating my own modified version of quickSort that also dedupes

Hmm, dedupe after quicksort is O(n); probably wouldn't notice it much next to the quicksort. It might even be faster to do quicksort + dedupe pass vs. doing tree-sort. Although, this probably doesn't matter that much, since it's only happening once per query. The sorted set approach is IMO OK for this patch, esp. since that's the same thing the in filter is doing.

yea, i toyed with a bit but didn't spend anytime measuring anything yet. I think the main thing that made me switch was the addition of the value coercion, with it too we iterate list once to make them all be the matchValueType, then the sort, then again to dedupe, so it seemed like maybe i should just use the set and coerce on insert and then turn that into a list.

Now i'm a bit curious though, so will do some measurements.

ok, so got around to measuring it and went back to quicksort, even with coercion and dedupe is still better on bigger sets

set: Benchmark (filterSize) Mode Cnt Score Error Units InFilterBenchmark.sortFilterValues 1 avgt 3 0.021 ± 0.006 us/op InFilterBenchmark.sortFilterValues 10 avgt 3 0.323 ± 0.035 us/op InFilterBenchmark.sortFilterValues 100 avgt 3 4.521 ± 0.483 us/op InFilterBenchmark.sortFilterValues 1000 avgt 3 104.984 ± 11.389 us/op InFilterBenchmark.sortFilterValues 10000 avgt 3 1906.647 ± 922.912 us/op quicksort: Benchmark (filterSize) Mode Cnt Score Error Units InFilterBenchmark.sortFilterValues 1 avgt 3 0.021 ± 0.005 us/op InFilterBenchmark.sortFilterValues 10 avgt 3 0.574 ± 0.083 us/op InFilterBenchmark.sortFilterValues 100 avgt 3 4.678 ± 1.100 us/op InFilterBenchmark.sortFilterValues 1000 avgt 3 84.306 ± 10.615 us/op InFilterBenchmark.sortFilterValues 10000 avgt 3 1555.324 ± 228.895 us/op

(I didn't add this benchmark to the PR)

gianm · 2024-03-20T04:35:16Z

+  }
+
+  private static DruidObjectPredicate<String> createStringPredicate(
+      final List sortedValues,


I usually prefer List<?> if it works ok, because it at least keeps the unknownness to the single type param rather than encouraging the dropping of all type params.

gianm · 2024-03-20T04:39:16Z

+import java.util.List;
+
+@RunWith(Enclosed.class)
+public class TypedInFilterTests


Is there a way to write a test class that subclasses InFilterTest so it automatically gets any new tests we add there? Maybe make inFilter a protected abstract method rather than a static. The copied tests here make it tough to properly maintain the test suite.

hmm, the new filter doesn't actually support default value mode though, so need to think of a way to handle that (as well as extractionFn tests). the other thing i could think of that might be a problem is that converting doubles to strings is sometimes ugly, so might still need to override some of those tests anyway depending on the mode

in the worst case i'll add javadoc comment to that test, and other classic filter tests that anything added there should be considered to also be added to the strongly typed replacement/equivalents tests

ok, consolidated these tests. its kind of ugly, but i think ok

abhishekagarwal87

My review is in-complete but posting it anyway since I might not get to the rest of the PR.

abhishekagarwal87 · 2024-03-05T06:43:24Z

                <groupId>com.amazonaws</groupId>
                <artifactId>aws-java-sdk-bundle</artifactId>
                <version>${aws.sdk.version}</version>
+                <scope>provided</scope>


why this change?

comment is outdated, iirc this was an accident from experimenting with something else in the wrong branch

abhishekagarwal87 · 2024-03-07T10:54:09Z

  static final byte RANGE_CACHE_ID = 0x14;
  static final byte IS_FILTER_BOOLEAN_FILTER_CACHE_ID = 0x15;
  static final byte ARRAY_CONTAINS_CACHE_ID = 0x16;
+  static final byte NEW_IN_CACHE_ID = 0x17;


Suggested change

static final byte NEW_IN_CACHE_ID = 0x17;

static final byte TYPED_IN_CACHE_ID = 0x17;

abhishekagarwal87 · 2024-03-07T10:57:30Z

+   * Creates a new filter.
+   *
+   * @param column         column to search
+   * @param values         set of values to match. This collection may be reused to avoid copying a big collection.


Missing param for sortedValues.

abhishekagarwal87 · 2024-03-07T10:58:30Z

+      throw InvalidInput.exception("Invalid IN filter on column [%s], matchValueType cannot be null", column);
+    }
+    // one of sorted or not sorted
+    if (values == null && sortedValues == null) {


should you also check that both of them are not null either

yea, might as well with the way things are now

abhishekagarwal87 · 2024-03-07T11:02:47Z

+    }
+    RangeSet<String> retSet = TreeRangeSet.create();
+    for (Object value : lazyMatchValues.get()) {
+      String valueEquivalent = NullHandling.nullToEmptyIfNeeded(Evals.asString(value));


since this filter only works in sql compatible mode, why wrap inside NullHandling.nullToEmptyIfNeeded?

stale code from when i was originally aspiring to support both modes, will remove

abhishekagarwal87 · 2024-03-07T11:05:38Z

+          column
+      );
+    }
+    if (sortedValues != null) {


what do you think of validating that sortedValues are indeed sorted?

im against it, it defeats the purpose of keeping them split, which is to not do extra work unless we need to. if people want to manually craft json that is broken then that's on them. if we document this filter i'll be sure to make it super clear that if this is set that it must be sorted exactly the same as the matchValueType, or maybe just not document that this property exists at all

added javadocs to hopefully better explain inner workings

probably best to document it, since it will show up in explain plan from SQL, and people will wonder what it means.

abhishekagarwal87 · 2024-03-07T11:09:00Z

+      }
+      return false;
+    }
+    return o1.size() == o2.size();


could be done before the loop to make the comparison faster.

abhishekagarwal87 · 2024-03-15T08:19:08Z

+import java.util.List;
+import java.util.NoSuchElementException;
+
+public interface ValueSetIndexes


the methods that callers use have javadocs which is consistent with all of these other interfaces.. but will maybe fill them in with some generic bit about getting BitmapColumnIndex for sets of values or .. something along with other interfaces

gianm · 2024-03-22T04:07:22Z

+      String valueEquivalent = Evals.asString(value);
      if (valueEquivalent == null) {
-        // Case when SQL compatible null handling is enabled
-        // Range.singleton(null) is invalid, so use the fact that


Most of this comment is still useful in explaining why we're doing lessThan("").

gianm · 2024-03-22T04:12:24Z

+
+  /**
+   * Supplier for list of values sorted by {@link #matchValueType}. This is lazily computed if
+   * {@link #unsortedValues} is not null and previously sorted.


would be useful to include a comment about whether this can contain duplicates or not, and if duplicates are not meant to be here, then also include a test that validates that deduplication happens on values (even if they are provided in sorted order).

gianm · 2024-03-22T04:13:52Z

+      };
+
+      // values are doubles and ordered in double order
+      if (matchValueType.is(ValueType.DOUBLE)) {


I am okay with what you decide here. I can see both ways making sense.

gianm

Approving since my remaining comments are about comments and optional refactors.

add new typed in filter

a3f7084

changes: * adds TypedInFilter which preserves matching sets in the native match value type * SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)

clintropolis added Performance Area - Querying labels Mar 5, 2024

github-actions Bot added Area - Segment Format and Ser/De Area - Dependencies labels Mar 5, 2024

clintropolis added 10 commits March 4, 2024 23:30

Merge remote-tracking branch 'upstream/master' into new-in-filter

40e2f83

adjust

da21223

only use in sql compatible mode

68feab4

Merge remote-tracking branch 'upstream/master' into new-in-filter

d5ac63b

Merge remote-tracking branch 'upstream/master' into new-in-filter

d327a0b

check for sortedness

f76ec09

Merge remote-tracking branch 'upstream/master' into new-in-filter

433ad78

fix java 8

d524306

dont explode benchmark on exception in setup

0e0f315

Merge remote-tracking branch 'upstream/master' into new-in-filter

d65568d

gianm reviewed Mar 20, 2024

View reviewed changes

abhishekagarwal87 reviewed Mar 20, 2024

View reviewed changes

clintropolis added 7 commits March 20, 2024 18:08

Merge remote-tracking branch 'upstream/master' into new-in-filter

9c31c44

tweaks and javadocs

ac7d4f2

fix test

1fdc549

unified in filter native test

7e280f1

cache id name

b6ce3d1

style

ce4d05b

javadoc

37ca67c

gianm reviewed Mar 22, 2024

View reviewed changes

gianm approved these changes Mar 22, 2024

View reviewed changes

clintropolis added 3 commits March 22, 2024 02:55

more comments, dedupe test

9527ada

faster sort

108c6cb

adjust

6d2299f

clintropolis merged commit b0a9c31 into apache:master Mar 22, 2024

clintropolis deleted the new-in-filter branch March 22, 2024 19:45

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

clintropolis mentioned this pull request Apr 8, 2025

Query performance significantly degrades after upgrading from 22 to 27 #17891

Closed

clintropolis mentioned this pull request Jun 18, 2025

Web console: improve SQL autocomplete and add JSON autocomplete #18126

Merged

	static final byte NEW_IN_CACHE_ID = 0x17;
	static final byte TYPED_IN_CACHE_ID = 0x17;

Conversation

clintropolis commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

clintropolis commented Mar 5, 2024 •

edited

Loading

gianm Mar 21, 2024 •

edited

Loading

gianm Mar 20, 2024 •

edited

Loading

clintropolis Mar 21, 2024 •

edited

Loading