Skip to content

add new typed in filter#16039

Merged
clintropolis merged 21 commits intoapache:masterfrom
clintropolis:new-in-filter
Mar 22, 2024
Merged

add new typed in filter#16039
clintropolis merged 21 commits intoapache:masterfrom
clintropolis:new-in-filter

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Mar 5, 2024

Description

Adds a new TypedInFilter similar to the work done for equality and range filters in #14542, as a replacement for InDimFilter which deals in native match value types instead of only supporting string sets. This results in a pretty decent performance increase particularly for matching numeric columns, both when using indexes and matchers, since we don't need to convert the values to do the comparison when the filter match value type matches the underlying column type.

The SQL planner uses new TypedInFilter whenever sqlUseBoundAndSelectors is set to false, which is itself by default tied to druid.generic.useDefaultValueForNull=false (also the default). This filter does not support default value mode at all, so if druid.generic.useDefaultValueForNull=true then this filter will not be used even if sqlUseBoundAndSelectors is set to false.

The JSON creator for TypedInFilter:

  @JsonCreator
  public TypedInFilter(
      @JsonProperty("column") String column,
      @JsonProperty("matchValueType") ColumnType matchValueType,
      @JsonProperty("values") @Nullable List<?> values,
      @JsonProperty("sortedValues") @Nullable List<?> sortedValues,
      @JsonProperty("filterTuning") @Nullable FilterTuning filterTuning
  )

accept either values which may or may not be sorted (the constructor checks), and sortedValues which is trusted to be sorted. Nearly all callers should just set values unless presorting the values, and if the values are already sorted then the O(n) scan of them to check for sortedness should be relatively cheap and will automatically move them to sortedValues. The sorting if needed is done lazily, only when doing things like json serialization, computing cache key, or checking if filters are equivalent. The idea is that the values can be sorted by the broker and then be guaranteed to be sorted when serialized to send to the historicals so that they don't have to waste any effort sorting the values.

Unfortunately the SQL planner does serialize these filters when explaining queries, which is the idea behind checking if the values are pre-sorted (and also the same type as the matchValueType parameter) in the constructor, as much of the SQL planning will actually have the values already sorted, the only cases which are not are things like join and lookup rewrites. Hopefully this will be a bit cheaper than building the sorted set that InDimFilter uses.

Speaking of sorted sets, this new filter uses plain java lists instead of a sorted set, and then Collections.binarySearch to locate items within them. Since these value sets are immutable by the time we need to find things in them, the overhead of sorted sets doesn't really bring much to the table, so this doing this binary search instead is slightly faster as well.

IN filters on numeric columns:

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        false  avgt    5  316.998 ±  2.726  ms/op
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        force  avgt    5  318.369 ±  3.914  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        false  avgt    5  171.867 ±  2.913  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        force  avgt    5  171.676 ±  2.216  ms/op
after:
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        false  avgt    5  265.012 ±  8.229  ms/op
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        force  avgt    5  273.842 ±  7.391  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        false  avgt    5  171.550 ±  1.633  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        force  avgt    5  173.540 ±  2.413  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        false  avgt    5  365.012 ±  6.052  ms/op
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        force  avgt    5  250.669 ±  3.304  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        false  avgt    5  218.247 ±  1.382  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        force  avgt    5  162.742 ±  4.829  ms/op
after:
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        false  avgt    5  307.309 ± 10.928  ms/op
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        force  avgt    5  237.739 ±  8.574  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        false  avgt    5  213.082 ±  3.638  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        force  avgt    5  155.719 ±  3.434  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        false  avgt    5  858.528 ±  6.713  ms/op
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        force  avgt    5  860.413 ± 10.305  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        false  avgt    5  146.786 ±  5.952  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        force  avgt    5  154.059 ±  8.155  ms/op
after:
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        false  avgt    5  260.688 ±  2.862  ms/op
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        force  avgt    5  256.196 ±  4.310  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        false  avgt    5  177.440 ±  1.757  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        force  avgt    5  147.445 ±  3.371  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        false  avgt    5  947.775 ± 12.952  ms/op
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        force  avgt    5  377.426 ± 10.417  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        false  avgt    5  220.622 ±  6.089  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        force  avgt    5  163.623 ±  1.712  ms/op
after:
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        false  avgt    5  325.515 ±  7.563  ms/op
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        force  avgt    5  255.146 ±  4.047  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        false  avgt    5  216.967 ±  7.754  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        force  avgt    5  157.717 ±  4.723  ms/op

"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        false  avgt    5  773.810 ±  7.200  ms/op
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        force  avgt    5  772.494 ±  3.224  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        false  avgt    5    2.169 ±  0.218  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        force  avgt    5    2.205 ±  0.217  ms/op
after:
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        false  avgt    5  108.259 ±  2.217  ms/op
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        force  avgt    5  108.179 ±  1.548  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        false  avgt    5    2.070 ±  0.144  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        force  avgt    5    2.019 ±  0.133  ms/op

"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        false  avgt    5  904.097 ±  9.795  ms/op
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        force  avgt    5  242.745 ±  5.692  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        false  avgt    5  128.160 ±  2.043  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        force  avgt    5  129.161 ±  5.200  ms/op
after:
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        false  avgt    5  230.324 ± 12.410  ms/op
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        force  avgt    5  201.999 ±  4.346  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        false  avgt    5  125.256 ± 18.262  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        force  avgt    5  123.719 ±  8.906  ms/op

IN filters on string columns:

SELECT * FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100')

before:
Benchmark              (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        false  avgt    5  272.734 ± 18.392  ms/op
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        force  avgt    5  272.081 ± 31.917  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        false  avgt    5  232.986 ± 14.621  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        force  avgt    5  229.473 ± 11.148  ms/op
after:
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        false  avgt    5  260.132 ±  5.194  ms/op
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        force  avgt    5  263.946 ±  4.753  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        false  avgt    5  228.280 ± 12.907  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        force  avgt    5  228.460 ± 10.062  ms/op


SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100') GROUP BY 1, 2

before:
Benchmark              (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        false  avgt    5   26.105 ±  2.186  ms/op
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        force  avgt    5   22.348 ±  1.196  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        false  avgt    5   25.959 ±  0.923  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        force  avgt    5   22.445 ±  1.647  ms/op
after:
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        false  avgt    5   25.170 ±  0.654  ms/op
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        force  avgt    5   21.514 ±  1.403  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        false  avgt    5   25.258 ±  0.501  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        force  avgt    5   21.564 ±  0.824  ms/op

The string results are pretty close since most of the internals are identical, though still shows a slight improvement from using the sorted list instead of sorted set.

Release note

TBD


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

changes:
* adds TypedInFilter which preserves matching sets in the native match value type
* SQL planner uses new TypedInFilter when druid.generic.useDefaultValueForNull=false (the default)
return null;
} else if (arrayElements.length == 1) {
if (plannerContext.isUseBoundsAndSelectors()) {
if (plannerContext.isUseBoundsAndSelectors() || (!simpleExtractionExpr.isDirectColumnAccess() && virtualColumnRegistry == null)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is virtualColumnRegistry null?

& if it can be null, then I think CodeQL makes a good point: could it be null in the else branch as well, where there is no null guard?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like it can be null in some join cases, https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidJoinQueryRel.java#L497 calls it with null. will add a guard

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, the else branch here only uses virtual column registry if its not a direct column access, so the registry cannot be null or else it would have made a selector filter with an extractionFn

Copy link
Copy Markdown
Contributor

@gianm gianm Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 in that case a comment would be useful? it seemed to confuse both us and CodeQL

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, reworked this a bit, if the virtual column registry is null it seems equivalent to just use an expression filter rather than forcing a selector filter + extractionFn, so I did that for now, since it seems better to do that and get away from reliance on using extractionFn defined on filters

@Nullable
BitmapColumnIndex forSortedValues(@Nonnull List<?> sortedValues, TypeSignature<ValueType> matchValueType);

static <T> BitmapColumnIndex getIndexFromSortedIteratorSortedMerged(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc for this and getIndexFromSortedIterator? For some reason, the names don't make it obvious to me what they do. (One is regular and one is SortedMerged?)

Reading the code, it looks like this one does a linear scan through the dictionary, and the other one does a series of binary searches. Maybe call them getIndexFromSortedIteratorWithScan and getIndexFromSortedIteratorWithBinarySearch.

Oh, now I see why it's called SortedMerged. The zipping is kind of like the merge step of a merge-sort. Still, I feel the "scan" and "binary search" names would make more sense. And Javadoc would definitely help.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 3 total methods, the sorted merge (getIndexFromSortedIteratorSortedMerged), the binary search with short-circuit when the values are sorted the same way as the dictionary and we get pas the end of the dictionary (getIndexFromSortedIterator), and the binary search when the values are not necessarily sorted the same way at the dictionary and so we must iterate them all (getIndexFromIterator).

Will add javadocs and try to give better names

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed stuff and added javadocs to explain the relation between these three functions and when to use them

};
}

static <T> BitmapColumnIndex getIndexFromIterator(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc would be useful for this one too.

};

// values are doubles and ordered in double order
if (matchValueType.is(ValueType.DOUBLE)) {
Copy link
Copy Markdown
Contributor

@gianm gianm Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to minimize the copied code between here and the LONG version, such as by using a shared helper somewhere?

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea probably, also for the other index implementations.. i can look into it since has been on my todo list for a while

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess the downside to doing this is that i was hoping at some point to specialize the value dictionaries for primitive values (e.g. #12846) and combining them would make that not very easy since would need a generic parameter instead of java primitive.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have not done this change yet, still thinking about it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with what you decide here. I can see both ways making sense.

import java.util.Set;
import java.util.SortedSet;

public class TypedInFilter extends AbstractOptimizableDimFilter implements Filter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Javadoc for this and InDimFilter that point to each other would be good. People new to this area of the code base will need to be informed that both exist.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, also added recommendations to InDimFilter, SelectorDimFilter, and BoundDimFilter to use TypedInFilter, EqualityFilter, NullFilter, and RangeFilter instead

}
if (matchValueType.is(ValueType.STRING)) {
this.lazyMatchValueBytes = Suppliers.memoize(() -> {
final SortedSet<ByteBuffer> matchValueBytes = new ObjectAVLTreeSet<>(ByteBufferUtils.utf8Comparator());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be a List rather than a tree set? The values are already in the right order thanks to lazyMatchValues.get().

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i would need to change the interface of Utf8ValueSetIndexes to not require a sorted set, but can go ahead and do that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed Utf8ValueSetIndexes to use a List now instead of a sortedset, which required some changes in InDimFilter too but i think is ok since the valueset there also ensures its ordered when it is created

Object coerced = coerceValue(array[i], matchValueType);
//noinspection ObjectEquality
if (coerced != null && array[i] == coerced) {
// assume list is all same type objects...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to cause issues when reading JSON that has values set rather than sortedValues? (Where they might not all be the same type)

Also, do we care? I suppose if we aren't documenting this, then there's no way we'd get JSON with values in it. The serializer always writes sortedValues. You know, though, if this is the thinking, then we should really also split the constructor into two: a @JsonCreator that takes sortedValues only (not values) and a non-Jackson-enabled constructor that accepts possibly-unsorted values. No sense in supporting values in the JSON if we aren't going to use it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yea I guess i could always coerce while sorting just to be safe, was trying to short-circuit if it looked like it wasnt going to be necessary and was just assuming people wouldn't troll the set with mixed type junk like strings and numbers over json (and a mix of like ints and longs wouldn't really be a problem because the column type comparators handle the stuff as numbers rather than exact types to guard against json serde shenanigans)

i guess i was still on the fence of whether or not we document. we did document the other new native typed filters, so was sort of assuming we would this one too, especially since there is a performance improvement

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added coercion while sorting just in case

this.unsortedValues = null;
this.lazyMatchValues = () -> sortedValues;
} else {
if (checkSorted(values, matchValueType)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for duplicates too, or are duplicates ok in lazyMatchValues? Similar question for sortValues: does it need to dedupe or is just sorting ok?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, checkSorted actually does mark it as not sorted if there are dupe values, but sortValues method used to dedupe and now it isn't, i should probably fix that, was lost when i switched to using ObjectArrays.quickSort.

that said, dupes wouldn't really break anything, just inefficient

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went back to using a sorted set to order and dedupe since I could think of a very efficient way to do this otherwise other than creating my own modified version of quickSort that also dedupes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, dedupe after quicksort is O(n); probably wouldn't notice it much next to the quicksort. It might even be faster to do quicksort + dedupe pass vs. doing tree-sort. Although, this probably doesn't matter that much, since it's only happening once per query. The sorted set approach is IMO OK for this patch, esp. since that's the same thing the in filter is doing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i toyed with a bit but didn't spend anytime measuring anything yet. I think the main thing that made me switch was the addition of the value coercion, with it too we iterate list once to make them all be the matchValueType, then the sort, then again to dedupe, so it seemed like maybe i should just use the set and coerce on insert and then turn that into a list.

Now i'm a bit curious though, so will do some measurements.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so got around to measuring it and went back to quicksort, even with coercion and dedupe is still better on bigger sets

set:
Benchmark                           (filterSize)  Mode  Cnt     Score     Error  Units
InFilterBenchmark.sortFilterValues             1  avgt    3     0.021 ±   0.006  us/op
InFilterBenchmark.sortFilterValues            10  avgt    3     0.323 ±   0.035  us/op
InFilterBenchmark.sortFilterValues           100  avgt    3     4.521 ±   0.483  us/op
InFilterBenchmark.sortFilterValues          1000  avgt    3   104.984 ±  11.389  us/op
InFilterBenchmark.sortFilterValues         10000  avgt    3  1906.647 ± 922.912  us/op

quicksort:
Benchmark                           (filterSize)  Mode  Cnt     Score     Error  Units
InFilterBenchmark.sortFilterValues             1  avgt    3     0.021 ±   0.005  us/op
InFilterBenchmark.sortFilterValues            10  avgt    3     0.574 ±   0.083  us/op
InFilterBenchmark.sortFilterValues           100  avgt    3     4.678 ±   1.100  us/op
InFilterBenchmark.sortFilterValues          1000  avgt    3    84.306 ±  10.615  us/op
InFilterBenchmark.sortFilterValues         10000  avgt    3  1555.324 ± 228.895  us/op

(I didn't add this benchmark to the PR)

}

private static DruidObjectPredicate<String> createStringPredicate(
final List sortedValues,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually prefer List<?> if it works ok, because it at least keeps the unknownness to the single type param rather than encouraging the dropping of all type params.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import java.util.List;

@RunWith(Enclosed.class)
public class TypedInFilterTests
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to write a test class that subclasses InFilterTest so it automatically gets any new tests we add there? Maybe make inFilter a protected abstract method rather than a static. The copied tests here make it tough to properly maintain the test suite.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the new filter doesn't actually support default value mode though, so need to think of a way to handle that (as well as extractionFn tests). the other thing i could think of that might be a problem is that converting doubles to strings is sometimes ugly, so might still need to override some of those tests anyway depending on the mode

in the worst case i'll add javadoc comment to that test, and other classic filter tests that anything added there should be considered to also be added to the strongly typed replacement/equivalents tests

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, consolidated these tests. its kind of ugly, but i think ok

Copy link
Copy Markdown
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review is in-complete but posting it anyway since I might not get to the rest of the PR.

<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>${aws.sdk.version}</version>
<scope>provided</scope>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is outdated, iirc this was an accident from experimenting with something else in the wrong branch

static final byte RANGE_CACHE_ID = 0x14;
static final byte IS_FILTER_BOOLEAN_FILTER_CACHE_ID = 0x15;
static final byte ARRAY_CONTAINS_CACHE_ID = 0x16;
static final byte NEW_IN_CACHE_ID = 0x17;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static final byte NEW_IN_CACHE_ID = 0x17;
static final byte TYPED_IN_CACHE_ID = 0x17;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Creates a new filter.
*
* @param column column to search
* @param values set of values to match. This collection may be reused to avoid copying a big collection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing param for sortedValues.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

throw InvalidInput.exception("Invalid IN filter on column [%s], matchValueType cannot be null", column);
}
// one of sorted or not sorted
if (values == null && sortedValues == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you also check that both of them are not null either

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, might as well with the way things are now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

}
RangeSet<String> retSet = TreeRangeSet.create();
for (Object value : lazyMatchValues.get()) {
String valueEquivalent = NullHandling.nullToEmptyIfNeeded(Evals.asString(value));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this filter only works in sql compatible mode, why wrap inside NullHandling.nullToEmptyIfNeeded?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale code from when i was originally aspiring to support both modes, will remove

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

column
);
}
if (sortedValues != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of validating that sortedValues are indeed sorted?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im against it, it defeats the purpose of keeping them split, which is to not do extra work unless we need to. if people want to manually craft json that is broken then that's on them. if we document this filter i'll be sure to make it super clear that if this is set that it must be sorted exactly the same as the matchValueType, or maybe just not document that this property exists at all

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added javadocs to hopefully better explain inner workings

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably best to document it, since it will show up in explain plan from SQL, and people will wonder what it means.

}
return false;
}
return o1.size() == o2.size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be done before the loop to make the comparison faster.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import java.util.List;
import java.util.NoSuchElementException;

public interface ValueSetIndexes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the methods that callers use have javadocs which is consistent with all of these other interfaces.. but will maybe fill them in with some generic bit about getting BitmapColumnIndex for sets of values or .. something along with other interfaces

String valueEquivalent = Evals.asString(value);
if (valueEquivalent == null) {
// Case when SQL compatible null handling is enabled
// Range.singleton(null) is invalid, so use the fact that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this comment is still useful in explaining why we're doing lessThan("").


/**
* Supplier for list of values sorted by {@link #matchValueType}. This is lazily computed if
* {@link #unsortedValues} is not null and previously sorted.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be useful to include a comment about whether this can contain duplicates or not, and if duplicates are not meant to be here, then also include a test that validates that deduplication happens on values (even if they are provided in sorted order).

};

// values are doubles and ordered in double order
if (matchValueType.is(ValueType.DOUBLE)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with what you decide here. I can see both ways making sense.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since my remaining comments are about comments and optional refactors.

@clintropolis clintropolis merged commit b0a9c31 into apache:master Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants