Filters: Use ColumnSelectorFactory directly for building row-based matchers. by gianm · Pull Request #3797 · apache/druid

gianm · 2016-12-22T21:24:11Z

This is part 1 of the proposal in https://groups.google.com/d/msg/druid-development/_Sd78s7yU5U/RJ8qLOJ5DwAJ. There is a net reduction in code size, I think that's a good sign!

Main changes:

Remove ValueMatcherFactory completely.
Change Filter.makeMatcher(ValueMatcherFactory) to Filter.makeMatcher(ColumnSelectorFactory).
Add some helper methods to Filters.java: Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, DruidPredicateFactory) and Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, String value).
Did not add DimensionSelector.makeValueMatcher(value) or DimensionSelector.makeValueMatcher(predicateFactory) because benchmarks looked fine without them.

Also contained:

a bug fix in the BoundFilter; it should have treated null numbers as before all other numbers numerically, but it didn't. This was needed to help one of the tests pass.
an optimization in makeValueMatcher for columns consisting entirely of nulls.

Benchmarks below. I chose FilteredAggregatorBenchmark because filtered aggregators use Filter.makeMatcher on both incremental and queryable indexes.

master

Benchmark                                                (rowsPerSegment)  (schema)  Mode  Cnt      Score     Error  Units
FilteredAggregatorBenchmark.querySingleIncrementalIndex             75000     basic  avgt   25  19335.333 ± 466.020  us/op
FilteredAggregatorBenchmark.querySingleQueryableIndex               75000     basic  avgt   25   5398.617 ± 775.755  us/op

patch

FilteredAggregatorBenchmark.querySingleIncrementalIndex             75000     basic  avgt   25  18009.880 ± 403.749  us/op
FilteredAggregatorBenchmark.querySingleQueryableIndex               75000     basic  avgt   25   5433.927 ± 504.427  us/op

…tchers.

leventov · 2016-12-22T23:09:06Z

            Long upperLong = GuavaUtils.tryParseLong(upper);
            if (upperLong == null) {
+              // Unparseable values fall before all actual numbers, so no numbers can match the upper bound.
              matchesAnything = false;


It feels that "matchesNothing" would be a less confusing name for this field

leventov · 2016-12-22T23:12:17Z

-            hasLowerLongBoundVolatile = true;
-            lowerLongBoundVolatile = lowerLong;
          } else {
            hasLowerLongBoundVolatile = false;


Volatiles are not needed on all fields, except longsInitialized, thanks that it is assigned after assignments of all other fields on line 412: https://github.com/druid-io/druid/pull/3797/files#diff-0e3cdf1515450ea46ede7c61b5829291R412, and checked first on line 376: https://github.com/druid-io/druid/pull/3797/files#diff-0e3cdf1515450ea46ede7c61b5829291R376, that makes a good happens-before edge.

Sounds good, I'll change that and leave a comment. It's not strictly related to this PR but why not.

@leventov what you said is true, but I don't see harm in marking them all volatile and keeping things simple here to understand. is there a catch?

@himanshug I think making things "extra" safe by adding unnecessary volatiles adds confusion because when people see this code, they try to understand "why volatile is present here? It seems unnecessary. Am I missing something?" And waste time double-checking themselves

@leventov may be its the difference in perspective. I find it simpler when volatile is marked explicitly when needed unless it is very clear from the context. current code has to have this comment....

// longsInitialized is volatile since it establishes the happens-before relationship on // writes/reads to the rest of the fields (it's written last and read first).

that is trying to explain why only one of them is marked volatile and not others but all are expected to behave like volatile.

in any case, this is a small difference. I am ok with current code if you and @gianm both think it is simpler.

@himanshug I disagree with "why only one of them is marked volatile and not others but all are expected to behave like volatile". This is similar to the discussion below: #3797 (comment). We don't try to make all other fields "to behave like volatile". Volatiles on all other fields in this anonymous Supplier implementation not only unnecessary, but also don't really provide any extra guarantees. E. g. leaving volatile on all fields, but removing it on longsInitialized would break this code. volatile on longsInitialized is necessary and sufficient.

@leventov I agree with you technically and the HB set by the write and read of longsInitialized . Now this is a small anonymous class where it is easy to see whole class and reason about HB relationship, that is why I said it makes small difference here. But , note that @gianm had to put a comment there to explain the HB relationship.

Now, think about a larger class and people relying too much on the HB set by one volatile variable to make sure other variable reads will happen after their writes... all is well. but , later someone modifies the code in a way that breaks the original HB assumption without remembering to now update other variables to be volatile and suddenly we would have a bug.

@himanshug having or not having volatile on all other fields doesn't make any difference to "safety" or guarantees provided, so if @gianm chose to add a comment, it probably should have been here before as well.

Also having volatile on all other fields doesn't protect you from introducing subtle bugs as described, E. g. reordering reads or writes would break this code despite all fields are volatile.

@gianm @leventov @himanshug

Weighing in with my own $0.02, I tend to take the approach where if you are having to go through significant mental acrobatics to wrap your head around the happens-before relationship, then it's a good time to step back and look at exactly what you are trying to do.

In this case, I actually don't see a reason why you can't keep the "longs have been initialized" variable as volatile, and then after the first check, synchronize and then build the actual DruidLongPredicate inside the sychronization and cache that. Then subsequent calls to get() just return that object.

From my read of that code, once the init happens once, it doesn't look like the state changes and if you pull all of that stuff up into the main method, then they are all method-local variables and it is really easy to understand their interplay and lifecycle.

@cheddar indeed this seems to be a better approach here. If the predicate object is cached in a field of Supplier, it could be a sole field of this class (predicate != null instead of init flag) and it shouldn't even be volatile, because the predicate is either stateless or has final fields.

leventov · 2016-12-22T23:14:46Z

+   *
+   * @return row types
+   */
+  public static Map<String, ValueType> rowTypeFor(final GroupByQuery query)


Maybe use some other word instead of "type", i. e. "form" or "signature"? Because "type" is already "type of the column" in the same context. Or use "columnTypes" instead of "rowType"?

rowSignature works for me.

leventov · 2016-12-22T23:16:59Z

+  private final Supplier<? extends Row> row;
+  private final Map<String, ValueType> columnTypes;
+
+  public RowBasedColumnSelectorFactory(final Supplier<? extends Row> row, final Map<String, ValueType> columnTypes)


Please add @Nullable to columnTypes

leventov · 2016-12-22T23:17:30Z

+
+  public static RowBasedColumnSelectorFactory create(
+      final Supplier<? extends Row> row,
+      final Map<String, ValueType> columnTypes


Please add @Nullable to columnTypes

leventov · 2016-12-23T06:28:10Z

    if (havingSpec != null) {
+      havingSpec.setRowType(GroupByQueryHelper.rowTypeFor(this));
      postProcFn = Functions.compose(
          postProcFn,


Maybe just apply this postProcFn inside the body of the Function declared the line below, instead of Function.compose()? When one or both functions are created right here Function.compose() seems to be pointless complication.

I prefer to leave it as-is.

leventov · 2016-12-23T06:29:32Z

-    }
-    finally {
-      valueMatcherFactory.setRow(null);
+    synchronized (valueMatcherMonitor) {


Even if you don't change the logic of GroupByQuery, this synchronization seems to be pointless to me because I don't see who is going to call this method concurrently.

The reason I had for making DimFilterHavingSpec thread-safe is that it's a single object shared by everything processing the groupBy query, and it seemed like making it non-thread-safe was asking for trouble. Some kinds of potential trouble are,

IntervalChunkingQueryRunner splits up result merging into multiple threads if chunkPeriod is set. Currently it doesn't split up the havingSpec processing, since the work is done on accumulate/toYielder rather than on construction of the Sequence, but it seems a flimsy guarantee that things will remain like this forever.

We may want to implement multithreaded merging of groupBy results in the future, independent of IntervalChunkingQueryRunner, and this may involve HavingSpecs being shared between threads.

But I guess this isn't really necessary, since I do believe that right now it's not shared between threads. So I can make it not thread safe and add a comment about that to HavingSpec.

I'm also adding a detector for concurrent usage just in case, so this will fail fast if anyone breaks the contract in the future.

leventov · 2016-12-23T06:41:10Z

+  private final ThreadLocal<Row> rowSupplier;
+
+  private final Object valueMatcherMonitor = new Object();
+  private volatile ValueMatcher valueMatcher;


volatile on this fields doesn't solve any problems. If groupBy work is split between threads and the parts of work are published without happens-before between the thread where setRowType() is called (i. e. the thread where GroupByQuery is constructed, in the current version of this PR), having volatile on this field still doesn't guarantee (theoretically) that non-null value will be read in eval().

However, if there is HB, it guarantees visibility of valueMatcher write in eval(), even if valueMatcher is not volatile.

And it's actually hard to transmit work between threads without HB unless you use something like standard, not synchronized Queue to transmit tasks. ExecutorService provides HB:

Memory consistency effects: Actions in a thread prior to the submission of a Runnable or Callable task to an ExecutorService happen-before any actions taken by that task, which in turn happen-before the result is retrieved via Future.get().

(from ExecutorService class-level Javadoc)

Are you saying that you'd prefer to remove the volatile here and rely on happens-before relationships established by passing the havingSpec to worker threads using executor services?

@gianm not only that, I pointed out that making valueMatcher volatile doesn't really provide you any extra guarantees. To guarantee visibility of some value in some field, you should add synchronized, volatile, or some other HB primitives on something else, not on the field itself.

Are you talking about the risk that a ValueMatcher impl will not be safely initialized and that its internal state will not be properly visible to another thread? In that case, yeah, that's a good point. I don't think any current ValueMatchers would have this behavior but I suppose they might in theory.

Or are you saying that it's possible for an object that was safely initialized, and written to a volatile reference, to be then read from another thread as a null reference? In my understanding of the JMM, that is not possible, since writing a safely initialized object to a volatile reference will cause the reference and also the object's initialized state to be visible to any thread that reads the reference.

Either way, I agree we can get rid of thread safetiness in DimFilterHavingSpec so this point is moot as far as the PR goes.

Yes, volatile on valueMatcher helps with visibility of valueMatcher's contents initialized, but it doesn't (theoretically) save you from NPE. Nothing in JMM stops VM from caching/hoisting volatile field from a very long running loop, and it won't see any volatile writes to this field, because it won't re-read it.

Though loop example is misleading, because even without loops, a thread is not required to read the "most recent" value from volatile, because there is no global order of volatile reads/writes in the program, so the "most recent" is not defined. Only if somebody write value x to a volatile field in a thread A and somebody else read that x from that field in a thread B, it forms a happens-before edge and allows to build partial orders. But thread B is not required to ever read that x.

@leventov this is to better understand #3797 (comment) , say, we have

private volatile ValueMatcher valueMatcher;

now, at time t1, Thread A does a write on that...

valueMather = new ValueMatcher() { ...}

and, at time t2 greater than t1, Thread B reads same...

if (valueMatcher == null) { System.out.println("I am null"); }

are you saying it is possible that "I am null" might get printed at t2 and afterwards by Thread B?

that sounds against the "volatile variable rule".... "A write to a volatile field happens-before every subsequent read of that field."

@himanshug yes, this is possible. "Afterwards" is not defined. In "A write to a volatile field happens-before every subsequent read of that field." (if this is a direct citation of JMM) some parts omitted, it should be "A write to a volatile field of value x happens-before every subsequent read of value x from that field.", + transitivity. But there is no HB if pre-x value is read from a field, null in your example.

leventov · 2016-12-23T06:42:04Z

        querySpecificConfig.getMaxMergingDictionarySize() / (concurrencyHint == -1 ? 1 : concurrencyHint)
    );
-    final RowBasedColumnSelectorFactory columnSelectorFactory = new RowBasedColumnSelectorFactory();
+    final ThreadLocal<Row> columnSelectorRow = new ThreadLocal<>();


Same consideration about reducing number of ThreadLocal accesses as above

Column selectors made here may be used by multiple different processing threads, so the ThreadLocal is useful.

leventov · 2016-12-23T07:08:28Z

-      return new BooleanValueMatcher(false);
-    }
-
-    return new ValueMatcher()


My concern is that the new realtime dimension value matcher (StringValueMatcherColumnSelectorStrategy, line 51) is less specialised than this one. The difference between benchmarks and production is that benchmarks run the same code (same impls of some interfaces, called in hot loops) again and again, so the calls are monomorphic and inlined, but in production they stay virtual, because different queries cause different impls of some interfaces appear in hot loops. In this particular case, the version of ValueMatcher from StringValueMatcherColumnSelectorStrategy may specialize getRow() because it is monomorphic, then eliminate IndexedInts allocation because it can prove that it doesn't escape, and then essentially run the same code as here, in StringDimensionIndexer.makeIndexingValueMatcher(). But in production getRow() in StringValueMatcherColumnSelectorStrategy won't be inlined, so IndexedInts indirection wrapping int[] array will remain.

So I suggest to add DimensionSelector.makeValueMatcher() anyway, maybe as a separate PR, but necessarily in the same Druid release as this PR.

I would really like to be able to see this happen in a benchmark, since otherwise it's challenging to prevent performance regressions. A future patch should be adding more kinds of ValueMatcherColumnSelectorStrategies. Do you think it'd work to have two different types of value matchers used in the same benchmark? We could do that in FilteredAggregatorBenchmark.

I'll file an issue for 0.10.0 to look into this.

Raised #3799.

Btw, you can cause situations like the one that leventov is describing in benchmarks by running the multiple code paths. Once Java knows that it can be called multiple ways, it will undo whatever specializations it has done that no longer apply. So, in the benchmark, if you actually benchmark things together in the same JVM you can see it.

Fwiw, when you find one of these, it can be fun to watch this happen and watch things speed up and slow down by adjusting the order of what runs first. The first one run is sometimes always fastest, no matter which algorithm it is.

@cheddar are you saying it would work to have two different types of value matchers in the same benchmark? We only have the one now (String) but we can definitely mix them together in the same benchmark when we add more.

Yes. You will want to do this before starting the timer on the benchmark and also make sure you run each of the options with enough iterations to cause optimizations to actually occur as well as the backing out of those optimizations. This is generally black magic unless you are running a debug build of the jvm where you can turn on logging of optimizations (something I've read about doing, but never done myself).

- BoundDimFilter: fewer volatiles, rename matchesAnything to !matchesNothing. - HavingSpecs: Clarify that they are not thread-safe, and make DimFilterHavingSpec not thread safe. - Renamed rowType to rowSignature. - Added specializations for time-based vs non-time-based DimensionSelector in RBCSF. - Added convenience method DimensionHanderUtils.createColumnSelectorPlus. - Added singleton ZeroIndexedInts. - Added test cases for DimFilterHavingSpec.

…elector.

gianm · 2016-12-23T19:27:19Z

@leventov thanks for the review. I just pushed a round of changes, please let me know what you think.

leventov · 2016-12-23T19:28:25Z

+          final List<String> dimensionValues = row.get().getDimension(dimension);
+          final int dimensionValuesSize = dimensionValues != null ? dimensionValues.size() : 0;
+
+          return new IndexedInts()


Similarly to ZeroIndexedInts, RangeIndexedInts could be extracted and certain amount of them cached globally.

leventov · 2016-12-23T19:34:41Z

            @Override
            public Sequence<Row> apply(Sequence<Row> input)
            {
+              GroupByQuery.this.havingSpec.setRowSignature(GroupByQueryHelper.rowSignatureFor(GroupByQuery.this));


Maybe makeHavingSpec immutable, by extracting the mutable part of its state into some Predicate<Row> impl returned from some method like HavingSpec.makePredicate(rowSignature)?

I think this is a good idea although not needed for this PR. It would be a good change if we ever do need to use HavingSpecs in multiple different threads.

leventov · 2016-12-23T19:36:33Z

+  @Override
+  public int get(int index)
+  {
+    return 0;


Have you intentionally omitted index check?

Yes, I don't think the check is adding much value. Reading past the end of a row is not a common bug. I can add the check back though if you think it's worth it.

@gianm I think it's ok, but it deserves a comment.

Hmm, I can also add an assert so there's a check during tests but not in production. assuming asserts are disabled in production, which I haven't verified.

gianm · 2016-12-23T21:11:32Z

@leventov: pushed new commits adding RangeIndexedInts, the ZeroIndexedInts comment, some tweaks to the concurrent usage guard on DimFilterHavingSpec, and some minor refactorings.

leventov

Do you prefer to not take this approach: #3797 (comment)?

leventov · 2016-12-23T21:53:14Z

+    this.size = size;
+  }
+
+  public static RangeIndexedInts create(final int size)


Maybe call this method "acquire()" or "getOrCreate()", because it doesn't necessarily create something?

I thought the little white lie here would be okay.

gianm · 2016-12-23T21:59:16Z

@leventov I like your suggestion but think it is not necessary for this PR.

leventov

Ok, LGTM

himanshug · 2016-12-28T18:29:58Z

👍

leventov · 2016-12-29T10:11:24Z

+
+      // Other fields are not volatile.
+      private boolean matchesNothing;
+      private boolean hasLowerLongBoundVolatile;


BTW now names of these fields shouldn't have "Volatile" suffix.

leventov

LGTM

jon-wei · 2017-01-03T22:05:10Z

👍, I think @cheddar's suggestion re: volatile/predicate handling in the BoundDimFilter is a nice approach if you feel like updating that in this PR

gianm · 2017-01-03T22:22:46Z

@jon-wei thanks for the review. I'd rather do that change as a separate one (if at all) since it's only tangentially related to this patch.

…tchers. (apache#3797) * Filters: Use ColumnSelectorFactory directly for building row-based matchers. * Adjustments based on code review. - BoundDimFilter: fewer volatiles, rename matchesAnything to !matchesNothing. - HavingSpecs: Clarify that they are not thread-safe, and make DimFilterHavingSpec not thread safe. - Renamed rowType to rowSignature. - Added specializations for time-based vs non-time-based DimensionSelector in RBCSF. - Added convenience method DimensionHanderUtils.createColumnSelectorPlus. - Added singleton ZeroIndexedInts. - Added test cases for DimFilterHavingSpec. * Make ValueMatcherColumnSelectorStrategy actually use the associated selector. * Add RangeIndexedInts. * DimFilterHavingSpec: Fix concurrent usage guard on jdk7. * Add assertion to ZeroIndexedInts. * Rename no-longer-volatile members.

Filters: Use ColumnSelectorFactory directly for building row-based ma…

3c6127d

…tchers.

gianm added the Improvement label Dec 22, 2016

gianm added this to the 0.10.0 milestone Dec 22, 2016

gianm assigned jon-wei Dec 22, 2016

gianm closed this Dec 22, 2016

gianm reopened this Dec 22, 2016

gianm closed this Dec 22, 2016

gianm reopened this Dec 22, 2016

leventov requested changes Dec 23, 2016

View reviewed changes

gianm mentioned this pull request Dec 23, 2016

Investigate IncrementalIndex ValueMatcher performance #3799

Closed

Make ValueMatcherColumnSelectorStrategy actually use the associated s…

8407d29

…elector.

leventov reviewed Dec 23, 2016

View reviewed changes

gianm added 3 commits December 23, 2016 15:07

Add RangeIndexedInts.

efe9f7c

DimFilterHavingSpec: Fix concurrent usage guard on jdk7.

b84b7a9

Add assertion to ZeroIndexedInts.

3c6401c

leventov reviewed Dec 23, 2016

View reviewed changes

leventov requested changes Dec 29, 2016

View reviewed changes

Rename no-longer-volatile members.

5c64dbe

leventov reviewed Dec 30, 2016

View reviewed changes

gianm merged commit d8702eb into apache:master Jan 3, 2017

gianm deleted the filter-csf branch January 3, 2017 22:30

gianm mentioned this pull request Jan 3, 2017

Ignore DimFilterHavingSpec testConcurrentUsage. #3814

Merged

jon-wei mentioned this pull request Jan 11, 2017

Add long/float ColumnSelectorStrategy implementations #3838

Merged

gianm mentioned this pull request Jan 12, 2017

Comparing dimensions to each other in a filter #3840

Closed

This was referenced Jan 17, 2017

Redundant volatiles in BaseTopNAlgorithm.BaseArrayProvider #3857

Closed

DimensionSelector-related bug fixes and optimizations (fixes #3799, part of #3798) #3858

Merged

clambertus unassigned jon-wei Jul 6, 2018

Conversation

gianm commented Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Dec 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug Dec 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Dec 22, 2016 •

edited

Loading

leventov Dec 29, 2016 •

edited

Loading

leventov Dec 23, 2016 •

edited

Loading

himanshug Dec 29, 2016 •

edited

Loading

gianm Dec 23, 2016 •

edited

Loading

leventov Dec 23, 2016 •

edited

Loading