Lazy-ify ValueMatcher BitSet optimization for string dimensions.#5403
Lazy-ify ValueMatcher BitSet optimization for string dimensions.#5403gianm merged 4 commits intoapache:masterfrom
Conversation
The idea is that if the prior evaluated filters are decently selective, such that they mean we won't see all possible values of the later filters, then the eager version of the optimization is too wasteful. This involves checking an extra bitset, but the overhead is small even if the lazy-ification is useless.
| .makeColumnValueSelector("sumLongSequential"); | ||
| while (!input.isDone()) { | ||
| long rowval = selector.getLong(); | ||
| blackhole.consume(rowval); |
There was a problem hiding this comment.
I suggest to use some real aggregator, like LongSum, to make the load more realistic
There was a problem hiding this comment.
Do you think it matters as long as the load is consistent between tests? I'd rather leave it using blackhole (especially since the string one can't switch to an aggregator anyway)
| while (!input.isDone()) { | ||
| for (DimensionSelector selector : selectors) { | ||
| IndexedInts row = selector.getRow(); | ||
| blackhole.consume(selector.lookupName(row.get(0))); |
There was a problem hiding this comment.
Suggested to use a real aggregator to make the load more realistic
There was a problem hiding this comment.
The input is a string so aggregators cannot run on it.
There was a problem hiding this comment.
Well, I guess "cardinality" could, but that seems like a bad load to test with since it's pretty heavy.
| ) | ||
| { | ||
| final BitSet predicateMatchingValueIds = makePredicateMatchingSet(selector, predicate); | ||
| final BitSet checkedIds = new BitSet(selector.getValueCardinality()); |
There was a problem hiding this comment.
Better to use one bitset (with two-bit "blocks") than two separate bitsets, for locality. Also, in the future it could be replaced with specialized class that fetches and writes two bits at a time, in order to make less ops in the hot loop.
There was a problem hiding this comment.
I tried this transformation and it ended up slower. Timings after the change are below (compare to the numbers from the PR description). I'm not sure why. Is it possible that the JVM is so smart as to only compute things like bitIndex (inside the BitSets) once, in the case where there are two different BitSets and you get the same index from both in immediate succession?
Benchmark (rowsPerSegment) (schema) Mode Cnt Score Error Units
FilterPartitionBenchmark.readWithExFnPostFilter 750000 basic avgt 25 10690.447 ± 222.691 us/op
FilterPartitionBenchmark.readAndFilter 750000 basic avgt 25 305.222 ± 5.011 us/op
The transformed, slower code was:
@Override
public ValueMatcher makeValueMatcher(final Predicate<String> predicate)
{
final BitSet matchingIds = new BitSet(getCardinality() * 2);
// Lazy matcher; only check an id if matches() is called.
return new ValueMatcher()
{
@Override
public boolean matches()
{
final int id = getRowValue();
final int checkBit = id * 2;
final int matchBit = checkBit + 1;
if (matchingIds.get(checkBit)) {
return matchingIds.get(matchBit);
} else {
final boolean matches = predicate.apply(lookupName(id));
matchingIds.set(checkBit);
if (matches) {
matchingIds.set(matchBit);
}
return matches;
}
}
@Override
public void inspectRuntimeShape(RuntimeShapeInspector inspector)
{
inspector.visit("column", SimpleDictionaryEncodedColumn.this);
}
};
}There was a problem hiding this comment.
Based on this experiment, and the other, simpler one below (#5403 (comment)), I guess the JVM is that smart. So the original code is fastest, because it doesn't have as many branches and can also reuse computations internal to BitSet.get.
There was a problem hiding this comment.
specialized class that fetches and writes two bits at a time, in order to make less ops in the hot loop.
This would probably be fastest of all, but since the current code is not too bad in terms of overhead even on the worse case, I think the optimization is not needed for this patch. I suspect that in the real world, it's more common to get the good case (the lazy optimization is useful and saves time) rather than the bad case (lazy optimization is useless, so we just hope it doesn't add too much overhead).
There was a problem hiding this comment.
Is it possible that the JVM is so smart as to only compute things like bitIndex (inside the BitSets) once, in the case where there are two different BitSets and you get the same index from both in immediate succession?
Could you verify this theory by looking at the assembly with -prof perfasm (dtraceasm on Mac)?
It's important to understand clearly why some counter intuitive results are observed.
There was a problem hiding this comment.
Hi @leventov do you have some tips on how to do this from within intellij on a Mac? I tried adding an annotation @Fork(value = 1, jvmArgs = "-prof dtraceasm") but I got a bunch of errors like this. I am hoping this is something you have run into in the past and may know how to fix.
Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/invoke/LambdaForm$MH
at com.sun.demo.jvmti.hprof.Tracker.nativeCallSite(Native Method)
at com.sun.demo.jvmti.hprof.Tracker.CallSite(Tracker.java:99)
at java.lang.invoke.InvokerBytecodeGenerator.emitNewArray(InvokerBytecodeGenerator.java:889)
at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCodeBytes(InvokerBytecodeGenerator.java:688)
at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCode(InvokerBytecodeGenerator.java:618)
at java.lang.invoke.LambdaForm.compileToBytecode(LambdaForm.java:654)
at java.lang.invoke.LambdaForm.prepare(LambdaForm.java:635)
at java.lang.invoke.MethodHandle.<init>(MethodHandle.java:461)
at java.lang.invoke.BoundMethodHandle.<init>(BoundMethodHandle.java:58)
at java.lang.invoke.BoundMethodHandle$Species_L.<init>(BoundMethodHandle.java:211)
at java.lang.invoke.BoundMethodHandle$Species_L.copyWith(BoundMethodHandle.java:228)
at java.lang.invoke.MethodHandle.asCollector(MethodHandle.java:1002)
at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:460)
at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:454)
at java.lang.invoke.MethodHandleImpl.makeVarargsCollector(MethodHandleImpl.java:445)
at java.lang.invoke.MethodHandle.setVarargs(MethodHandle.java:1325)
at java.lang.invoke.MethodHandles$Lookup.getDirectMethodCommon(MethodHandles.java:1665)
at java.lang.invoke.MethodHandles$Lookup.getDirectMethod(MethodHandles.java:1600)
at java.lang.invoke.MethodHandles$Lookup.findStatic(MethodHandles.java:779)
at java.lang.invoke.MethodHandleImpl$Lazy.<clinit>(MethodHandleImpl.java:627)
at java.lang.invoke.MethodHandleImpl.varargsArray(MethodHandleImpl.java:1506)
at java.lang.invoke.MethodHandleImpl.varargsArray(MethodHandleImpl.java:1623)
at java.lang.invoke.MethodHandle.asCollector(MethodHandle.java:999)
at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:460)
at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:454)
at java.lang.invoke.MethodHandleImpl.makeVarargsCollector(MethodHandleImpl.java:445)
at java.lang.invoke.MethodHandle.setVarargs(MethodHandle.java:1325)
at java.lang.invoke.MethodHandles$Lookup.getDirectMethodCommon(MethodHandles.java:1665)
at java.lang.invoke.MethodHandles$Lookup.getDirectMethod(MethodHandles.java:1600)
at java.lang.invoke.MethodHandles$Lookup.findStatic(MethodHandles.java:779)
at java.lang.invoke.CallSite.<clinit>(CallSite.java:226)
at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
at java.io.ObjectInputStream.<clinit>(ObjectInputStream.java:3578)
at org.openjdk.jmh.runner.link.BinaryLinkClient.<init>(BinaryLinkClient.java:74)
at org.openjdk.jmh.runner.ForkedMain.main(ForkedMain.java:72)
HPROF ERROR: Unexpected Exception found afterward [hprof_util.c:494]
HPROF TERMINATED PROCESS
There was a problem hiding this comment.
I always run (and never heard anybody does anything different) JMH with profilers from the command line. Running benchmarks from an IDE is not a good idea anyway (might it affect the results that you have obtained previously?)
-prof perfasm is not a JVM arg, it's an arg of the JMH Main. Try java -jar target/benchmarks -prof list first.
If you are on Mac, you should use dtraceasm. You should also disable "System integrity protection" in order to be able to do this: http://osxdaily.com/2015/10/05/disable-rootless-system-integrity-protection-mac-os-x/
There was a problem hiding this comment.
I use a JMH IDEA plugin: https://github.com/artyushov/idea-jmh-plugin. Although I have never read the readme until now, and apparently it can affect the results slightly (they say they observed up to a couple of %).
What do you do to run the Druid JMH benchmarks? Do you know offhand what needs to be done to build them before they can be run? It'd be nice to add something to the Druid developer docs (https://github.com/druid-io/druid/blob/master/CONTRIBUTING.md) teaching people like me what to do.
| this, | ||
| predicate | ||
| ); | ||
| final BitSet checkedIds = new BitSet(getCardinality()); |
| } else { | ||
| matches = predicate.apply(selector.lookupName(id)); | ||
| checkedIds.set(id); | ||
| matchingIds.set(id, matches); |
There was a problem hiding this comment.
if (matches) { matchingIds.set(id); } is better
There was a problem hiding this comment.
This transformation gives worse performance in the benchmark:
FilterPartitionBenchmark.readWithExFnPostFilter 750000 basic avgt 25 10587.515 ± 789.090 us/op
I suspect it is due to adding a branch where none previously existed.
There was a problem hiding this comment.
It's impossible because if you look at the source code of set(index, boolean), it has the same branching, plus it does extra work in the case when matches is false, that we can avoid. Something is wrong with your measurements.
There was a problem hiding this comment.
Well, you're right. Even though I tried it twice yesterday (since I thought it surprising) and got this result, this morning I tried again and got something that was in line with the original result (around 8800 µs/op). So I'll make this minor transformation, because why not. The other, bigger transformation (collapsing to one bitset) still measures as slower though.
There was a problem hiding this comment.
Pushed the minor transformation.
| } else { | ||
| final boolean matches = predicate.apply(lookupName(id)); | ||
| checkedIds.set(id); | ||
| matchingIds.set(id, matches); |
|
This is order 5% slow down for the bad case. Can when the good vs bad cases would be encountered be expanded more? |
Sure. ValueMatchers are for so-called "post filtering" which means filtering during a row scan, rather than using the index. Imagine a filter like The bad case is where this does not happen: where there is no pruning of col2 values based on a col1 filter. In this case, laziness doesn't save any time. Some more context on this change is: today Druid uses indexes very aggressively, so string post filtering doesn't happen very often. I think it actually may never happen on QueryableIndex, since between QueryableIndexStorageAdapter (which uses bitmaps for any top-level filters that support them) and OrFilter/AndFilter (which have makeMatcher impls that use bitmaps when possible for subfilters) it looks like it may never get called. So today, I think this change really just affects IncrementalIndex filtering. The motivation behind this change is to open up future optimizations like:
|
|
@gianm could we make the matcher eager then, if there is just one filter? |
It would make sense, although would be a more substantial refactoring of a lot of layers of interfaces (makeMatcher has to know how which mode to run in). I would rather not do it in this PR. Especially since I don't think it will have an effect on QueryableIndex based queries right now. And even if we do explore using ValueMatchers more, it probably won't be too important then either, since we'd aim to only use them in the good case (if there's just one filter we'd use bitmaps, unless the user disabled bitmaps for that column). |
|
Might be worth creating an issue then. |
|
👍 |
Sure I think that makes sense and will do it if this patch is merged as is. I have not had a chance to figure out how to do the verification you suggested in #5403 (comment) and I was hoping you have some tips about how to do that.
I thought about this a little more and I realized there is a case where this change will affect QueryableIndex based queries. Filtered aggregators call makeMatcher unconditionally so they will always use this matcher code. I believe that it's still worth it though, since in the benchmarks we see 5% slowdown in the bad case and 50x speedup in an ideal case. |
|
To me, the patch could be merged (if some issues are created), @drcrallen what do you think? |
…che#5403) * Lazy-ify ValueMatcher BitSet optimization for string dimensions. The idea is that if the prior evaluated filters are decently selective, such that they mean we won't see all possible values of the later filters, then the eager version of the optimization is too wasteful. This involves checking an extra bitset, but the overhead is small even if the lazy-ification is useless. * Remove import. * Minor transformation
Follow-up to apache#5403, which only lazy-ified cursor-based filtering on QueryableIndex.
* Lazy-ify IncrementalIndex filtering too. Follow-up to #5403, which only lazy-ified cursor-based filtering on QueryableIndex. * Fix logic error.
…che#5403) * Lazy-ify ValueMatcher BitSet optimization for string dimensions. The idea is that if the prior evaluated filters are decently selective, such that they mean we won't see all possible values of the later filters, then the eager version of the optimization is too wasteful. This involves checking an extra bitset, but the overhead is small even if the lazy-ification is useless. * Remove import. * Minor transformation
* Lazy-ify IncrementalIndex filtering too. Follow-up to apache#5403, which only lazy-ified cursor-based filtering on QueryableIndex. * Fix logic error.
* Lazy-ify IncrementalIndex filtering too. Follow-up to apache#5403, which only lazy-ified cursor-based filtering on QueryableIndex. * Fix logic error.
The idea is that if the prior evaluated filters are decently selective,
such that they mean we won't see all possible values of the later
filters, then the eager version of the optimization is too wasteful.
This involves checking an extra bitset, but the overhead is small even
if the lazy-ification is useless.
For reference see the "readWithExFnPostFilter" benchmark (which does
not benefit from this optimization) and "readAndFilter" (which does).
Related: #5402, which will be more useful if the non-indexed columns
can be filtered on with this lazy matching optimization.