Lazy-ify ValueMatcher BitSet optimization for string dimensions. by gianm · Pull Request #5403 · apache/druid

gianm · 2018-02-20T07:19:51Z

The idea is that if the prior evaluated filters are decently selective,
such that they mean we won't see all possible values of the later
filters, then the eager version of the optimization is too wasteful.

This involves checking an extra bitset, but the overhead is small even
if the lazy-ification is useless.

For reference see the "readWithExFnPostFilter" benchmark (which does
not benefit from this optimization) and "readAndFilter" (which does).

Related: #5402, which will be more useful if the non-indexed columns
can be filtered on with this lazy matching optimization.

master

Benchmark                                         (rowsPerSegment)  (schema)  Mode  Cnt     Score      Error  Units
FilterPartitionBenchmark.readWithExFnPostFilter             750000     basic  avgt   25  8499.008  ± 192.483  us/op
FilterPartitionBenchmark.readAndFilter                      750000     basic  avgt   25  16649.892 ± 374.661  us/op

patch

Benchmark                                         (rowsPerSegment)  (schema)  Mode  Cnt     Score      Error  Units
FilterPartitionBenchmark.readWithExFnPostFilter             750000     basic  avgt   25  8822.299  ± 193.224  us/op
FilterPartitionBenchmark.readAndFilter                      750000     basic  avgt   25  308.379   ±   9.603  us/op

The idea is that if the prior evaluated filters are decently selective, such that they mean we won't see all possible values of the later filters, then the eager version of the optimization is too wasteful. This involves checking an extra bitset, but the overhead is small even if the lazy-ification is useless.

leventov · 2018-02-20T16:15:32Z

+                                                      .makeColumnValueSelector("sumLongSequential");
+          while (!input.isDone()) {
+            long rowval = selector.getLong();
+            blackhole.consume(rowval);


I suggest to use some real aggregator, like LongSum, to make the load more realistic

Do you think it matters as long as the load is consistent between tests? I'd rather leave it using blackhole (especially since the string one can't switch to an aggregator anyway)

leventov · 2018-02-20T16:16:06Z

+          while (!input.isDone()) {
+            for (DimensionSelector selector : selectors) {
+              IndexedInts row = selector.getRow();
+              blackhole.consume(selector.lookupName(row.get(0)));


Suggested to use a real aggregator to make the load more realistic

The input is a string so aggregators cannot run on it.

Well, I guess "cardinality" could, but that seems like a bad load to test with since it's pretty heavy.

leventov · 2018-02-20T16:18:07Z

  )
  {
-    final BitSet predicateMatchingValueIds = makePredicateMatchingSet(selector, predicate);
+    final BitSet checkedIds = new BitSet(selector.getValueCardinality());


Better to use one bitset (with two-bit "blocks") than two separate bitsets, for locality. Also, in the future it could be replaced with specialized class that fetches and writes two bits at a time, in order to make less ops in the hot loop.

I tried this transformation and it ended up slower. Timings after the change are below (compare to the numbers from the PR description). I'm not sure why. Is it possible that the JVM is so smart as to only compute things like bitIndex (inside the BitSets) once, in the case where there are two different BitSets and you get the same index from both in immediate succession?

Benchmark (rowsPerSegment) (schema) Mode Cnt Score Error Units FilterPartitionBenchmark.readWithExFnPostFilter 750000 basic avgt 25 10690.447 ± 222.691 us/op FilterPartitionBenchmark.readAndFilter 750000 basic avgt 25 305.222 ± 5.011 us/op

The transformed, slower code was:

@Override public ValueMatcher makeValueMatcher(final Predicate<String> predicate) { final BitSet matchingIds = new BitSet(getCardinality() * 2); // Lazy matcher; only check an id if matches() is called. return new ValueMatcher() { @Override public boolean matches() { final int id = getRowValue(); final int checkBit = id * 2; final int matchBit = checkBit + 1; if (matchingIds.get(checkBit)) { return matchingIds.get(matchBit); } else { final boolean matches = predicate.apply(lookupName(id)); matchingIds.set(checkBit); if (matches) { matchingIds.set(matchBit); } return matches; } } @Override public void inspectRuntimeShape(RuntimeShapeInspector inspector) { inspector.visit("column", SimpleDictionaryEncodedColumn.this); } }; }

Based on this experiment, and the other, simpler one below (#5403 (comment)), I guess the JVM is that smart. So the original code is fastest, because it doesn't have as many branches and can also reuse computations internal to BitSet.get.

specialized class that fetches and writes two bits at a time, in order to make less ops in the hot loop.

This would probably be fastest of all, but since the current code is not too bad in terms of overhead even on the worse case, I think the optimization is not needed for this patch. I suspect that in the real world, it's more common to get the good case (the lazy optimization is useful and saves time) rather than the bad case (lazy optimization is useless, so we just hope it doesn't add too much overhead).

Is it possible that the JVM is so smart as to only compute things like bitIndex (inside the BitSets) once, in the case where there are two different BitSets and you get the same index from both in immediate succession?

Could you verify this theory by looking at the assembly with -prof perfasm (dtraceasm on Mac)?

It's important to understand clearly why some counter intuitive results are observed.

Hi @leventov do you have some tips on how to do this from within intellij on a Mac? I tried adding an annotation @Fork(value = 1, jvmArgs = "-prof dtraceasm") but I got a bunch of errors like this. I am hoping this is something you have run into in the past and may know how to fix.

Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/invoke/LambdaForm$MH at com.sun.demo.jvmti.hprof.Tracker.nativeCallSite(Native Method) at com.sun.demo.jvmti.hprof.Tracker.CallSite(Tracker.java:99) at java.lang.invoke.InvokerBytecodeGenerator.emitNewArray(InvokerBytecodeGenerator.java:889) at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCodeBytes(InvokerBytecodeGenerator.java:688) at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCode(InvokerBytecodeGenerator.java:618) at java.lang.invoke.LambdaForm.compileToBytecode(LambdaForm.java:654) at java.lang.invoke.LambdaForm.prepare(LambdaForm.java:635) at java.lang.invoke.MethodHandle.<init>(MethodHandle.java:461) at java.lang.invoke.BoundMethodHandle.<init>(BoundMethodHandle.java:58) at java.lang.invoke.BoundMethodHandle$Species_L.<init>(BoundMethodHandle.java:211) at java.lang.invoke.BoundMethodHandle$Species_L.copyWith(BoundMethodHandle.java:228) at java.lang.invoke.MethodHandle.asCollector(MethodHandle.java:1002) at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:460) at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:454) at java.lang.invoke.MethodHandleImpl.makeVarargsCollector(MethodHandleImpl.java:445) at java.lang.invoke.MethodHandle.setVarargs(MethodHandle.java:1325) at java.lang.invoke.MethodHandles$Lookup.getDirectMethodCommon(MethodHandles.java:1665) at java.lang.invoke.MethodHandles$Lookup.getDirectMethod(MethodHandles.java:1600) at java.lang.invoke.MethodHandles$Lookup.findStatic(MethodHandles.java:779) at java.lang.invoke.MethodHandleImpl$Lazy.<clinit>(MethodHandleImpl.java:627) at java.lang.invoke.MethodHandleImpl.varargsArray(MethodHandleImpl.java:1506) at java.lang.invoke.MethodHandleImpl.varargsArray(MethodHandleImpl.java:1623) at java.lang.invoke.MethodHandle.asCollector(MethodHandle.java:999) at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:460) at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:454) at java.lang.invoke.MethodHandleImpl.makeVarargsCollector(MethodHandleImpl.java:445) at java.lang.invoke.MethodHandle.setVarargs(MethodHandle.java:1325) at java.lang.invoke.MethodHandles$Lookup.getDirectMethodCommon(MethodHandles.java:1665) at java.lang.invoke.MethodHandles$Lookup.getDirectMethod(MethodHandles.java:1600) at java.lang.invoke.MethodHandles$Lookup.findStatic(MethodHandles.java:779) at java.lang.invoke.CallSite.<clinit>(CallSite.java:226) at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307) at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297) at java.io.ObjectInputStream.<clinit>(ObjectInputStream.java:3578) at org.openjdk.jmh.runner.link.BinaryLinkClient.<init>(BinaryLinkClient.java:74) at org.openjdk.jmh.runner.ForkedMain.main(ForkedMain.java:72) HPROF ERROR: Unexpected Exception found afterward [hprof_util.c:494] HPROF TERMINATED PROCESS

I always run (and never heard anybody does anything different) JMH with profilers from the command line. Running benchmarks from an IDE is not a good idea anyway (might it affect the results that you have obtained previously?)

-prof perfasm is not a JVM arg, it's an arg of the JMH Main. Try java -jar target/benchmarks -prof list first.

If you are on Mac, you should use dtraceasm. You should also disable "System integrity protection" in order to be able to do this: http://osxdaily.com/2015/10/05/disable-rootless-system-integrity-protection-mac-os-x/

I use a JMH IDEA plugin: https://github.com/artyushov/idea-jmh-plugin. Although I have never read the readme until now, and apparently it can affect the results slightly (they say they observed up to a couple of %).

What do you do to run the Druid JMH benchmarks? Do you know offhand what needs to be done to build them before they can be run? It'd be nice to add something to the Druid developer docs (https://github.com/druid-io/druid/blob/master/CONTRIBUTING.md) teaching people like me what to do.

leventov · 2018-02-20T16:18:19Z

-              this,
-              predicate
-          );
+          final BitSet checkedIds = new BitSet(getCardinality());


leventov · 2018-02-20T16:20:52Z

+            } else {
+              matches = predicate.apply(selector.lookupName(id));
+              checkedIds.set(id);
+              matchingIds.set(id, matches);


if (matches) { matchingIds.set(id); } is better

This transformation gives worse performance in the benchmark:

FilterPartitionBenchmark.readWithExFnPostFilter 750000 basic avgt 25 10587.515 ± 789.090 us/op

I suspect it is due to adding a branch where none previously existed.

It's impossible because if you look at the source code of set(index, boolean), it has the same branching, plus it does extra work in the case when matches is false, that we can avoid. Something is wrong with your measurements.

Well, you're right. Even though I tried it twice yesterday (since I thought it surprising) and got this result, this morning I tried again and got something that was in line with the original result (around 8800 µs/op). So I'll make this minor transformation, because why not. The other, bigger transformation (collapsing to one bitset) still measures as slower though.

Pushed the minor transformation.

leventov · 2018-02-20T16:21:05Z

+              } else {
+                final boolean matches = predicate.apply(lookupName(id));
+                checkedIds.set(id);
+                matchingIds.set(id, matches);


drcrallen · 2018-02-21T15:43:13Z

This is order 5% slow down for the bad case. Can when the good vs bad cases would be encountered be expanded more?

gianm · 2018-02-21T19:19:30Z

This is order 5% slow down for the bad case. Can when the good vs bad cases would be encountered be expanded more?

Sure. ValueMatchers are for so-called "post filtering" which means filtering during a row scan, rather than using the index.

Imagine a filter like WHERE col1 = 'foo' AND col2 LIKE '%bar%' where col1 and col2 are both unindexed (so we need to do ValueMatcher filtering for both). The good case is where within the set of rows where col1 is 'foo', you don't see the entire constellation of possible col2 values. It means lazy matching saves a lot of time. The good case should be very common for multi-tenant type datasets (imagine WHERE customer_id = '123' AND product_id = '456' if products are unique to customers).

The bad case is where this does not happen: where there is no pruning of col2 values based on a col1 filter. In this case, laziness doesn't save any time.

Some more context on this change is: today Druid uses indexes very aggressively, so string post filtering doesn't happen very often. I think it actually may never happen on QueryableIndex, since between QueryableIndexStorageAdapter (which uses bitmaps for any top-level filters that support them) and OrFilter/AndFilter (which have makeMatcher impls that use bitmaps when possible for subfilters) it looks like it may never get called. So today, I think this change really just affects IncrementalIndex filtering.

The motivation behind this change is to open up future optimizations like:

Combined with the feature in Support for disabling bitmap indexes. #5402, make it more feasible to load big unique string values (like free form text) without wasting space and also wasting time on eager filtering when those columns are filtered on (it is very likely they will end up in the "good" case)
I want to explore optimizations that involve not always blindly using indexes. Consider a filter like WHERE col1 = '123' AND col2 LIKE '%foo%' where col1 and col2 both have bitmap indexes. If the filter col1 = '123' is very selective then the query will likely be faster if we avoid using the index for col2 LIKE '%foo%' (saving the work needed to examine every value and union a bunch of bitmaps) and instead run it as a ValueMatcher based filter.

leventov · 2018-02-21T21:17:35Z

@gianm could we make the matcher eager then, if there is just one filter?

gianm · 2018-02-21T21:46:48Z

@gianm could we make the matcher eager then, if there is just one filter?

It would make sense, although would be a more substantial refactoring of a lot of layers of interfaces (makeMatcher has to know how which mode to run in). I would rather not do it in this PR. Especially since I don't think it will have an effect on QueryableIndex based queries right now. And even if we do explore using ValueMatchers more, it probably won't be too important then either, since we'd aim to only use them in the good case (if there's just one filter we'd use bitmaps, unless the user disabled bitmaps for that column).

leventov · 2018-02-21T21:55:36Z

Might be worth creating an issue then.

fjy · 2018-02-27T23:55:16Z

👍

gianm · 2018-03-01T08:59:48Z

Might be worth creating an issue then.

Sure I think that makes sense and will do it if this patch is merged as is.

I have not had a chance to figure out how to do the verification you suggested in #5403 (comment) and I was hoping you have some tips about how to do that.

So today, I think this change really just affects IncrementalIndex filtering.

I thought about this a little more and I realized there is a case where this change will affect QueryableIndex based queries. Filtered aggregators call makeMatcher unconditionally so they will always use this matcher code. I believe that it's still worth it though, since in the benchmarks we see 5% slowdown in the bad case and 50x speedup in an ideal case.

leventov · 2018-03-01T12:16:09Z

To me, the patch could be merged (if some issues are created), @drcrallen what do you think?

gianm · 2018-06-05T04:53:42Z

I've raised #5844 and merged master back into this branch. It looks like @leventov is ok with the patch based on his last comment, so I'll commit this once CI passes (unless anyone speaks up!)

…che#5403) * Lazy-ify ValueMatcher BitSet optimization for string dimensions. The idea is that if the prior evaluated filters are decently selective, such that they mean we won't see all possible values of the later filters, then the eager version of the optimization is too wasteful. This involves checking an extra bitset, but the overhead is small even if the lazy-ification is useless. * Remove import. * Minor transformation

Follow-up to apache#5403, which only lazy-ified cursor-based filtering on QueryableIndex.

* Lazy-ify IncrementalIndex filtering too. Follow-up to #5403, which only lazy-ified cursor-based filtering on QueryableIndex. * Fix logic error.

…che#5403) * Lazy-ify ValueMatcher BitSet optimization for string dimensions. The idea is that if the prior evaluated filters are decently selective, such that they mean we won't see all possible values of the later filters, then the eager version of the optimization is too wasteful. This involves checking an extra bitset, but the overhead is small even if the lazy-ification is useless. * Remove import. * Minor transformation

* Lazy-ify IncrementalIndex filtering too. Follow-up to apache#5403, which only lazy-ified cursor-based filtering on QueryableIndex. * Fix logic error.

gianm added Performance Area - Querying labels Feb 20, 2018

Remove import.

ff824ff

leventov requested changes Feb 20, 2018

View reviewed changes

Minor transformation

bd1635c

Merge branch 'master' into feature-lazy-valueMatching

5b852a9

gianm mentioned this pull request Jun 5, 2018

Filters: Eagerly apply ValueMatcher BitSet optimization when there is only one filter #5844

Closed

gianm merged commit 78fd27c into apache:master Jun 5, 2018

gianm deleted the feature-lazy-valueMatching branch June 5, 2018 16:31

gianm added a commit to gianm/druid that referenced this pull request Jun 6, 2018

Lazy-ify IncrementalIndex filtering too.

a8173fe

Follow-up to apache#5403, which only lazy-ified cursor-based filtering on QueryableIndex.

gianm mentioned this pull request Jun 6, 2018

Lazy-ify IncrementalIndex filtering too. #5852

Merged

jon-wei pushed a commit that referenced this pull request Jun 7, 2018

Lazy-ify IncrementalIndex filtering too. (#5852)

3af9591

* Lazy-ify IncrementalIndex filtering too. Follow-up to #5403, which only lazy-ified cursor-based filtering on QueryableIndex. * Fix logic error.

dclim added this to the 0.13.0 milestone Oct 8, 2018

Conversation

gianm commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Feb 21, 2018

Uh oh!

gianm commented Feb 21, 2018

Uh oh!

leventov commented Feb 21, 2018

Uh oh!

gianm commented Feb 21, 2018

Uh oh!

leventov commented Feb 21, 2018

Uh oh!

fjy commented Feb 27, 2018

Uh oh!

gianm commented Mar 1, 2018

Uh oh!

leventov commented Mar 1, 2018

Uh oh!

gianm commented Jun 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gianm commented Feb 20, 2018 •

edited

Loading

gianm Feb 21, 2018 •

edited

Loading

gianm Feb 21, 2018 •

edited

Loading

leventov Feb 21, 2018 •

edited

Loading