Skip to content

Lazy-ify ValueMatcher BitSet optimization for string dimensions.#5403

Merged
gianm merged 4 commits intoapache:masterfrom
gianm:feature-lazy-valueMatching
Jun 5, 2018
Merged

Lazy-ify ValueMatcher BitSet optimization for string dimensions.#5403
gianm merged 4 commits intoapache:masterfrom
gianm:feature-lazy-valueMatching

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Feb 20, 2018

The idea is that if the prior evaluated filters are decently selective,
such that they mean we won't see all possible values of the later
filters, then the eager version of the optimization is too wasteful.

This involves checking an extra bitset, but the overhead is small even
if the lazy-ification is useless.

For reference see the "readWithExFnPostFilter" benchmark (which does
not benefit from this optimization) and "readAndFilter" (which does).

Related: #5402, which will be more useful if the non-indexed columns
can be filtered on with this lazy matching optimization.

master

Benchmark                                         (rowsPerSegment)  (schema)  Mode  Cnt     Score      Error  Units
FilterPartitionBenchmark.readWithExFnPostFilter             750000     basic  avgt   25  8499.008  ± 192.483  us/op
FilterPartitionBenchmark.readAndFilter                      750000     basic  avgt   25  16649.892 ± 374.661  us/op

patch

Benchmark                                         (rowsPerSegment)  (schema)  Mode  Cnt     Score      Error  Units
FilterPartitionBenchmark.readWithExFnPostFilter             750000     basic  avgt   25  8822.299  ± 193.224  us/op
FilterPartitionBenchmark.readAndFilter                      750000     basic  avgt   25  308.379   ±   9.603  us/op

The idea is that if the prior evaluated filters are decently selective,
such that they mean we won't see all possible values of the later
filters, then the eager version of the optimization is too wasteful.

This involves checking an extra bitset, but the overhead is small even
if the lazy-ification is useless.
.makeColumnValueSelector("sumLongSequential");
while (!input.isDone()) {
long rowval = selector.getLong();
blackhole.consume(rowval);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use some real aggregator, like LongSum, to make the load more realistic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it matters as long as the load is consistent between tests? I'd rather leave it using blackhole (especially since the string one can't switch to an aggregator anyway)

while (!input.isDone()) {
for (DimensionSelector selector : selectors) {
IndexedInts row = selector.getRow();
blackhole.consume(selector.lookupName(row.get(0)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested to use a real aggregator to make the load more realistic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input is a string so aggregators cannot run on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess "cardinality" could, but that seems like a bad load to test with since it's pretty heavy.

)
{
final BitSet predicateMatchingValueIds = makePredicateMatchingSet(selector, predicate);
final BitSet checkedIds = new BitSet(selector.getValueCardinality());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use one bitset (with two-bit "blocks") than two separate bitsets, for locality. Also, in the future it could be replaced with specialized class that fetches and writes two bits at a time, in order to make less ops in the hot loop.

Copy link
Copy Markdown
Contributor Author

@gianm gianm Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this transformation and it ended up slower. Timings after the change are below (compare to the numbers from the PR description). I'm not sure why. Is it possible that the JVM is so smart as to only compute things like bitIndex (inside the BitSets) once, in the case where there are two different BitSets and you get the same index from both in immediate succession?

Benchmark                                        (rowsPerSegment)  (schema)  Mode  Cnt    Score       Error  Units
FilterPartitionBenchmark.readWithExFnPostFilter            750000     basic  avgt   25  10690.447 ± 222.691  us/op
FilterPartitionBenchmark.readAndFilter                     750000     basic  avgt   25  305.222   ±   5.011  us/op

The transformed, slower code was:

        @Override
        public ValueMatcher makeValueMatcher(final Predicate<String> predicate)
        {
          final BitSet matchingIds = new BitSet(getCardinality() * 2);

          // Lazy matcher; only check an id if matches() is called.
          return new ValueMatcher()
          {
            @Override
            public boolean matches()
            {
              final int id = getRowValue();
              final int checkBit = id * 2;
              final int matchBit = checkBit + 1;

              if (matchingIds.get(checkBit)) {
                return matchingIds.get(matchBit);
              } else {
                final boolean matches = predicate.apply(lookupName(id));
                matchingIds.set(checkBit);
                if (matches) {
                  matchingIds.set(matchBit);
                }
                return matches;
              }
            }

            @Override
            public void inspectRuntimeShape(RuntimeShapeInspector inspector)
            {
              inspector.visit("column", SimpleDictionaryEncodedColumn.this);
            }
          };
        }

Copy link
Copy Markdown
Contributor Author

@gianm gianm Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this experiment, and the other, simpler one below (#5403 (comment)), I guess the JVM is that smart. So the original code is fastest, because it doesn't have as many branches and can also reuse computations internal to BitSet.get.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specialized class that fetches and writes two bits at a time, in order to make less ops in the hot loop.

This would probably be fastest of all, but since the current code is not too bad in terms of overhead even on the worse case, I think the optimization is not needed for this patch. I suspect that in the real world, it's more common to get the good case (the lazy optimization is useful and saves time) rather than the bad case (lazy optimization is useless, so we just hope it doesn't add too much overhead).

Copy link
Copy Markdown
Member

@leventov leventov Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the JVM is so smart as to only compute things like bitIndex (inside the BitSets) once, in the case where there are two different BitSets and you get the same index from both in immediate succession?

Could you verify this theory by looking at the assembly with -prof perfasm (dtraceasm on Mac)?

It's important to understand clearly why some counter intuitive results are observed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @leventov do you have some tips on how to do this from within intellij on a Mac? I tried adding an annotation @Fork(value = 1, jvmArgs = "-prof dtraceasm") but I got a bunch of errors like this. I am hoping this is something you have run into in the past and may know how to fix.

Exception in thread "main" java.lang.NoClassDefFoundError: java/lang/invoke/LambdaForm$MH
	at com.sun.demo.jvmti.hprof.Tracker.nativeCallSite(Native Method)
	at com.sun.demo.jvmti.hprof.Tracker.CallSite(Tracker.java:99)
	at java.lang.invoke.InvokerBytecodeGenerator.emitNewArray(InvokerBytecodeGenerator.java:889)
	at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCodeBytes(InvokerBytecodeGenerator.java:688)
	at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCode(InvokerBytecodeGenerator.java:618)
	at java.lang.invoke.LambdaForm.compileToBytecode(LambdaForm.java:654)
	at java.lang.invoke.LambdaForm.prepare(LambdaForm.java:635)
	at java.lang.invoke.MethodHandle.<init>(MethodHandle.java:461)
	at java.lang.invoke.BoundMethodHandle.<init>(BoundMethodHandle.java:58)
	at java.lang.invoke.BoundMethodHandle$Species_L.<init>(BoundMethodHandle.java:211)
	at java.lang.invoke.BoundMethodHandle$Species_L.copyWith(BoundMethodHandle.java:228)
	at java.lang.invoke.MethodHandle.asCollector(MethodHandle.java:1002)
	at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:460)
	at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:454)
	at java.lang.invoke.MethodHandleImpl.makeVarargsCollector(MethodHandleImpl.java:445)
	at java.lang.invoke.MethodHandle.setVarargs(MethodHandle.java:1325)
	at java.lang.invoke.MethodHandles$Lookup.getDirectMethodCommon(MethodHandles.java:1665)
	at java.lang.invoke.MethodHandles$Lookup.getDirectMethod(MethodHandles.java:1600)
	at java.lang.invoke.MethodHandles$Lookup.findStatic(MethodHandles.java:779)
	at java.lang.invoke.MethodHandleImpl$Lazy.<clinit>(MethodHandleImpl.java:627)
	at java.lang.invoke.MethodHandleImpl.varargsArray(MethodHandleImpl.java:1506)
	at java.lang.invoke.MethodHandleImpl.varargsArray(MethodHandleImpl.java:1623)
	at java.lang.invoke.MethodHandle.asCollector(MethodHandle.java:999)
	at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:460)
	at java.lang.invoke.MethodHandleImpl$AsVarargsCollector.<init>(MethodHandleImpl.java:454)
	at java.lang.invoke.MethodHandleImpl.makeVarargsCollector(MethodHandleImpl.java:445)
	at java.lang.invoke.MethodHandle.setVarargs(MethodHandle.java:1325)
	at java.lang.invoke.MethodHandles$Lookup.getDirectMethodCommon(MethodHandles.java:1665)
	at java.lang.invoke.MethodHandles$Lookup.getDirectMethod(MethodHandles.java:1600)
	at java.lang.invoke.MethodHandles$Lookup.findStatic(MethodHandles.java:779)
	at java.lang.invoke.CallSite.<clinit>(CallSite.java:226)
	at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
	at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
	at java.io.ObjectInputStream.<clinit>(ObjectInputStream.java:3578)
	at org.openjdk.jmh.runner.link.BinaryLinkClient.<init>(BinaryLinkClient.java:74)
	at org.openjdk.jmh.runner.ForkedMain.main(ForkedMain.java:72)
HPROF ERROR: Unexpected Exception found afterward [hprof_util.c:494]
HPROF TERMINATED PROCESS

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always run (and never heard anybody does anything different) JMH with profilers from the command line. Running benchmarks from an IDE is not a good idea anyway (might it affect the results that you have obtained previously?)

-prof perfasm is not a JVM arg, it's an arg of the JMH Main. Try java -jar target/benchmarks -prof list first.

If you are on Mac, you should use dtraceasm. You should also disable "System integrity protection" in order to be able to do this: http://osxdaily.com/2015/10/05/disable-rootless-system-integrity-protection-mac-os-x/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a JMH IDEA plugin: https://github.com/artyushov/idea-jmh-plugin. Although I have never read the readme until now, and apparently it can affect the results slightly (they say they observed up to a couple of %).

What do you do to run the Druid JMH benchmarks? Do you know offhand what needs to be done to build them before they can be run? It'd be nice to add something to the Druid developer docs (https://github.com/druid-io/druid/blob/master/CONTRIBUTING.md) teaching people like me what to do.

this,
predicate
);
final BitSet checkedIds = new BitSet(getCardinality());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

} else {
matches = predicate.apply(selector.lookupName(id));
checkedIds.set(id);
matchingIds.set(id, matches);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (matches) { matchingIds.set(id); } is better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation gives worse performance in the benchmark:

FilterPartitionBenchmark.readWithExFnPostFilter            750000     basic  avgt   25  10587.515 ± 789.090  us/op

I suspect it is due to adding a branch where none previously existed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible because if you look at the source code of set(index, boolean), it has the same branching, plus it does extra work in the case when matches is false, that we can avoid. Something is wrong with your measurements.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you're right. Even though I tried it twice yesterday (since I thought it surprising) and got this result, this morning I tried again and got something that was in line with the original result (around 8800 µs/op). So I'll make this minor transformation, because why not. The other, bigger transformation (collapsing to one bitset) still measures as slower though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the minor transformation.

} else {
final boolean matches = predicate.apply(lookupName(id));
checkedIds.set(id);
matchingIds.set(id, matches);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@drcrallen
Copy link
Copy Markdown
Contributor

This is order 5% slow down for the bad case. Can when the good vs bad cases would be encountered be expanded more?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 21, 2018

This is order 5% slow down for the bad case. Can when the good vs bad cases would be encountered be expanded more?

Sure. ValueMatchers are for so-called "post filtering" which means filtering during a row scan, rather than using the index.

Imagine a filter like WHERE col1 = 'foo' AND col2 LIKE '%bar%' where col1 and col2 are both unindexed (so we need to do ValueMatcher filtering for both). The good case is where within the set of rows where col1 is 'foo', you don't see the entire constellation of possible col2 values. It means lazy matching saves a lot of time. The good case should be very common for multi-tenant type datasets (imagine WHERE customer_id = '123' AND product_id = '456' if products are unique to customers).

The bad case is where this does not happen: where there is no pruning of col2 values based on a col1 filter. In this case, laziness doesn't save any time.

Some more context on this change is: today Druid uses indexes very aggressively, so string post filtering doesn't happen very often. I think it actually may never happen on QueryableIndex, since between QueryableIndexStorageAdapter (which uses bitmaps for any top-level filters that support them) and OrFilter/AndFilter (which have makeMatcher impls that use bitmaps when possible for subfilters) it looks like it may never get called. So today, I think this change really just affects IncrementalIndex filtering.

The motivation behind this change is to open up future optimizations like:

  • Combined with the feature in Support for disabling bitmap indexes. #5402, make it more feasible to load big unique string values (like free form text) without wasting space and also wasting time on eager filtering when those columns are filtered on (it is very likely they will end up in the "good" case)
  • I want to explore optimizations that involve not always blindly using indexes. Consider a filter like WHERE col1 = '123' AND col2 LIKE '%foo%' where col1 and col2 both have bitmap indexes. If the filter col1 = '123' is very selective then the query will likely be faster if we avoid using the index for col2 LIKE '%foo%' (saving the work needed to examine every value and union a bunch of bitmaps) and instead run it as a ValueMatcher based filter.

@leventov
Copy link
Copy Markdown
Member

@gianm could we make the matcher eager then, if there is just one filter?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 21, 2018

@gianm could we make the matcher eager then, if there is just one filter?

It would make sense, although would be a more substantial refactoring of a lot of layers of interfaces (makeMatcher has to know how which mode to run in). I would rather not do it in this PR. Especially since I don't think it will have an effect on QueryableIndex based queries right now. And even if we do explore using ValueMatchers more, it probably won't be too important then either, since we'd aim to only use them in the good case (if there's just one filter we'd use bitmaps, unless the user disabled bitmaps for that column).

@leventov
Copy link
Copy Markdown
Member

Might be worth creating an issue then.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Feb 27, 2018

👍

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Mar 1, 2018

Might be worth creating an issue then.

Sure I think that makes sense and will do it if this patch is merged as is.

I have not had a chance to figure out how to do the verification you suggested in #5403 (comment) and I was hoping you have some tips about how to do that.

So today, I think this change really just affects IncrementalIndex filtering.

I thought about this a little more and I realized there is a case where this change will affect QueryableIndex based queries. Filtered aggregators call makeMatcher unconditionally so they will always use this matcher code. I believe that it's still worth it though, since in the benchmarks we see 5% slowdown in the bad case and 50x speedup in an ideal case.

@leventov
Copy link
Copy Markdown
Member

leventov commented Mar 1, 2018

To me, the patch could be merged (if some issues are created), @drcrallen what do you think?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 5, 2018

I've raised #5844 and merged master back into this branch. It looks like @leventov is ok with the patch based on his last comment, so I'll commit this once CI passes (unless anyone speaks up!)

@gianm gianm merged commit 78fd27c into apache:master Jun 5, 2018
@gianm gianm deleted the feature-lazy-valueMatching branch June 5, 2018 16:31
gianm added a commit to implydata/druid-public that referenced this pull request Jun 5, 2018
…che#5403)

* Lazy-ify ValueMatcher BitSet optimization for string dimensions.

The idea is that if the prior evaluated filters are decently selective,
such that they mean we won't see all possible values of the later
filters, then the eager version of the optimization is too wasteful.

This involves checking an extra bitset, but the overhead is small even
if the lazy-ification is useless.

* Remove import.

* Minor transformation
gianm added a commit to gianm/druid that referenced this pull request Jun 6, 2018
Follow-up to apache#5403, which only lazy-ified cursor-based filtering
on QueryableIndex.
jon-wei pushed a commit that referenced this pull request Jun 7, 2018
* Lazy-ify IncrementalIndex filtering too.

Follow-up to #5403, which only lazy-ified cursor-based filtering
on QueryableIndex.

* Fix logic error.
gianm added a commit to implydata/druid-public that referenced this pull request Jun 7, 2018
…che#5403)

* Lazy-ify ValueMatcher BitSet optimization for string dimensions.

The idea is that if the prior evaluated filters are decently selective,
such that they mean we won't see all possible values of the later
filters, then the eager version of the optimization is too wasteful.

This involves checking an extra bitset, but the overhead is small even
if the lazy-ification is useless.

* Remove import.

* Minor transformation
gianm added a commit to implydata/druid-public that referenced this pull request Jun 7, 2018
* Lazy-ify IncrementalIndex filtering too.

Follow-up to apache#5403, which only lazy-ified cursor-based filtering
on QueryableIndex.

* Fix logic error.
gianm added a commit to implydata/druid-public that referenced this pull request Jun 7, 2018
* Lazy-ify IncrementalIndex filtering too.

Follow-up to apache#5403, which only lazy-ified cursor-based filtering
on QueryableIndex.

* Fix logic error.
@dclim dclim added this to the 0.13.0 milestone Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants