Filters: Use ColumnSelectorFactory directly for building row-based matchers.#3797
Filters: Use ColumnSelectorFactory directly for building row-based matchers.#3797gianm merged 7 commits intoapache:masterfrom
Conversation
| Long upperLong = GuavaUtils.tryParseLong(upper); | ||
| if (upperLong == null) { | ||
| // Unparseable values fall before all actual numbers, so no numbers can match the upper bound. | ||
| matchesAnything = false; |
There was a problem hiding this comment.
It feels that "matchesNothing" would be a less confusing name for this field
| hasLowerLongBoundVolatile = true; | ||
| lowerLongBoundVolatile = lowerLong; | ||
| } else { | ||
| hasLowerLongBoundVolatile = false; |
There was a problem hiding this comment.
Volatiles are not needed on all fields, except longsInitialized, thanks that it is assigned after assignments of all other fields on line 412: https://github.com/druid-io/druid/pull/3797/files#diff-0e3cdf1515450ea46ede7c61b5829291R412, and checked first on line 376: https://github.com/druid-io/druid/pull/3797/files#diff-0e3cdf1515450ea46ede7c61b5829291R376, that makes a good happens-before edge.
There was a problem hiding this comment.
Sounds good, I'll change that and leave a comment. It's not strictly related to this PR but why not.
There was a problem hiding this comment.
@leventov what you said is true, but I don't see harm in marking them all volatile and keeping things simple here to understand. is there a catch?
There was a problem hiding this comment.
@himanshug I think making things "extra" safe by adding unnecessary volatiles adds confusion because when people see this code, they try to understand "why volatile is present here? It seems unnecessary. Am I missing something?" And waste time double-checking themselves
There was a problem hiding this comment.
@leventov may be its the difference in perspective. I find it simpler when volatile is marked explicitly when needed unless it is very clear from the context. current code has to have this comment....
// longsInitialized is volatile since it establishes the happens-before relationship on
// writes/reads to the rest of the fields (it's written last and read first).
that is trying to explain why only one of them is marked volatile and not others but all are expected to behave like volatile.
in any case, this is a small difference. I am ok with current code if you and @gianm both think it is simpler.
There was a problem hiding this comment.
@himanshug I disagree with "why only one of them is marked volatile and not others but all are expected to behave like volatile". This is similar to the discussion below: #3797 (comment). We don't try to make all other fields "to behave like volatile". Volatiles on all other fields in this anonymous Supplier implementation not only unnecessary, but also don't really provide any extra guarantees. E. g. leaving volatile on all fields, but removing it on longsInitialized would break this code. volatile on longsInitialized is necessary and sufficient.
There was a problem hiding this comment.
@leventov I agree with you technically and the HB set by the write and read of longsInitialized . Now this is a small anonymous class where it is easy to see whole class and reason about HB relationship, that is why I said it makes small difference here. But , note that @gianm had to put a comment there to explain the HB relationship.
Now, think about a larger class and people relying too much on the HB set by one volatile variable to make sure other variable reads will happen after their writes... all is well. but , later someone modifies the code in a way that breaks the original HB assumption without remembering to now update other variables to be volatile and suddenly we would have a bug.
There was a problem hiding this comment.
@himanshug having or not having volatile on all other fields doesn't make any difference to "safety" or guarantees provided, so if @gianm chose to add a comment, it probably should have been here before as well.
Also having volatile on all other fields doesn't protect you from introducing subtle bugs as described, E. g. reordering reads or writes would break this code despite all fields are volatile.
There was a problem hiding this comment.
Weighing in with my own $0.02, I tend to take the approach where if you are having to go through significant mental acrobatics to wrap your head around the happens-before relationship, then it's a good time to step back and look at exactly what you are trying to do.
In this case, I actually don't see a reason why you can't keep the "longs have been initialized" variable as volatile, and then after the first check, synchronize and then build the actual DruidLongPredicate inside the sychronization and cache that. Then subsequent calls to get() just return that object.
From my read of that code, once the init happens once, it doesn't look like the state changes and if you pull all of that stuff up into the main method, then they are all method-local variables and it is really easy to understand their interplay and lifecycle.
There was a problem hiding this comment.
@cheddar indeed this seems to be a better approach here. If the predicate object is cached in a field of Supplier, it could be a sole field of this class (predicate != null instead of init flag) and it shouldn't even be volatile, because the predicate is either stateless or has final fields.
| * | ||
| * @return row types | ||
| */ | ||
| public static Map<String, ValueType> rowTypeFor(final GroupByQuery query) |
There was a problem hiding this comment.
Maybe use some other word instead of "type", i. e. "form" or "signature"? Because "type" is already "type of the column" in the same context. Or use "columnTypes" instead of "rowType"?
There was a problem hiding this comment.
rowSignature works for me.
| private final Supplier<? extends Row> row; | ||
| private final Map<String, ValueType> columnTypes; | ||
|
|
||
| public RowBasedColumnSelectorFactory(final Supplier<? extends Row> row, final Map<String, ValueType> columnTypes) |
There was a problem hiding this comment.
Please add @Nullable to columnTypes
|
|
||
| public static RowBasedColumnSelectorFactory create( | ||
| final Supplier<? extends Row> row, | ||
| final Map<String, ValueType> columnTypes |
There was a problem hiding this comment.
Please add @Nullable to columnTypes
| if (havingSpec != null) { | ||
| havingSpec.setRowType(GroupByQueryHelper.rowTypeFor(this)); | ||
| postProcFn = Functions.compose( | ||
| postProcFn, |
There was a problem hiding this comment.
Maybe just apply this postProcFn inside the body of the Function declared the line below, instead of Function.compose()? When one or both functions are created right here Function.compose() seems to be pointless complication.
There was a problem hiding this comment.
I prefer to leave it as-is.
| } | ||
| finally { | ||
| valueMatcherFactory.setRow(null); | ||
| synchronized (valueMatcherMonitor) { |
There was a problem hiding this comment.
Even if you don't change the logic of GroupByQuery, this synchronization seems to be pointless to me because I don't see who is going to call this method concurrently.
There was a problem hiding this comment.
The reason I had for making DimFilterHavingSpec thread-safe is that it's a single object shared by everything processing the groupBy query, and it seemed like making it non-thread-safe was asking for trouble. Some kinds of potential trouble are,
- IntervalChunkingQueryRunner splits up result merging into multiple threads if
chunkPeriodis set. Currently it doesn't split up the havingSpec processing, since the work is done on accumulate/toYielder rather than on construction of the Sequence, but it seems a flimsy guarantee that things will remain like this forever. - We may want to implement multithreaded merging of groupBy results in the future, independent of IntervalChunkingQueryRunner, and this may involve HavingSpecs being shared between threads.
But I guess this isn't really necessary, since I do believe that right now it's not shared between threads. So I can make it not thread safe and add a comment about that to HavingSpec.
There was a problem hiding this comment.
I'm also adding a detector for concurrent usage just in case, so this will fail fast if anyone breaks the contract in the future.
| private final ThreadLocal<Row> rowSupplier; | ||
|
|
||
| private final Object valueMatcherMonitor = new Object(); | ||
| private volatile ValueMatcher valueMatcher; |
There was a problem hiding this comment.
volatile on this fields doesn't solve any problems. If groupBy work is split between threads and the parts of work are published without happens-before between the thread where setRowType() is called (i. e. the thread where GroupByQuery is constructed, in the current version of this PR), having volatile on this field still doesn't guarantee (theoretically) that non-null value will be read in eval().
However, if there is HB, it guarantees visibility of valueMatcher write in eval(), even if valueMatcher is not volatile.
And it's actually hard to transmit work between threads without HB unless you use something like standard, not synchronized Queue to transmit tasks. ExecutorService provides HB:
Memory consistency effects: Actions in a thread prior to the submission of a Runnable or Callable task to an ExecutorService happen-before any actions taken by that task, which in turn happen-before the result is retrieved via Future.get().
(from ExecutorService class-level Javadoc)
There was a problem hiding this comment.
Are you saying that you'd prefer to remove the volatile here and rely on happens-before relationships established by passing the havingSpec to worker threads using executor services?
There was a problem hiding this comment.
@gianm not only that, I pointed out that making valueMatcher volatile doesn't really provide you any extra guarantees. To guarantee visibility of some value in some field, you should add synchronized, volatile, or some other HB primitives on something else, not on the field itself.
There was a problem hiding this comment.
Are you talking about the risk that a ValueMatcher impl will not be safely initialized and that its internal state will not be properly visible to another thread? In that case, yeah, that's a good point. I don't think any current ValueMatchers would have this behavior but I suppose they might in theory.
Or are you saying that it's possible for an object that was safely initialized, and written to a volatile reference, to be then read from another thread as a null reference? In my understanding of the JMM, that is not possible, since writing a safely initialized object to a volatile reference will cause the reference and also the object's initialized state to be visible to any thread that reads the reference.
Either way, I agree we can get rid of thread safetiness in DimFilterHavingSpec so this point is moot as far as the PR goes.
There was a problem hiding this comment.
Yes, volatile on valueMatcher helps with visibility of valueMatcher's contents initialized, but it doesn't (theoretically) save you from NPE. Nothing in JMM stops VM from caching/hoisting volatile field from a very long running loop, and it won't see any volatile writes to this field, because it won't re-read it.
There was a problem hiding this comment.
Though loop example is misleading, because even without loops, a thread is not required to read the "most recent" value from volatile, because there is no global order of volatile reads/writes in the program, so the "most recent" is not defined. Only if somebody write value x to a volatile field in a thread A and somebody else read that x from that field in a thread B, it forms a happens-before edge and allows to build partial orders. But thread B is not required to ever read that x.
There was a problem hiding this comment.
@leventov this is to better understand #3797 (comment) , say, we have
private volatile ValueMatcher valueMatcher;
now, at time t1, Thread A does a write on that...
valueMather = new ValueMatcher() { ...}
and, at time t2 greater than t1, Thread B reads same...
if (valueMatcher == null) {
System.out.println("I am null");
}
are you saying it is possible that "I am null" might get printed at t2 and afterwards by Thread B?
that sounds against the "volatile variable rule".... "A write to a volatile field happens-before every subsequent read of that field."
There was a problem hiding this comment.
@himanshug yes, this is possible. "Afterwards" is not defined. In "A write to a volatile field happens-before every subsequent read of that field." (if this is a direct citation of JMM) some parts omitted, it should be "A write to a volatile field of value x happens-before every subsequent read of value x from that field.", + transitivity. But there is no HB if pre-x value is read from a field, null in your example.
| querySpecificConfig.getMaxMergingDictionarySize() / (concurrencyHint == -1 ? 1 : concurrencyHint) | ||
| ); | ||
| final RowBasedColumnSelectorFactory columnSelectorFactory = new RowBasedColumnSelectorFactory(); | ||
| final ThreadLocal<Row> columnSelectorRow = new ThreadLocal<>(); |
There was a problem hiding this comment.
Same consideration about reducing number of ThreadLocal accesses as above
There was a problem hiding this comment.
Column selectors made here may be used by multiple different processing threads, so the ThreadLocal is useful.
| return new BooleanValueMatcher(false); | ||
| } | ||
|
|
||
| return new ValueMatcher() |
There was a problem hiding this comment.
My concern is that the new realtime dimension value matcher (StringValueMatcherColumnSelectorStrategy, line 51) is less specialised than this one. The difference between benchmarks and production is that benchmarks run the same code (same impls of some interfaces, called in hot loops) again and again, so the calls are monomorphic and inlined, but in production they stay virtual, because different queries cause different impls of some interfaces appear in hot loops. In this particular case, the version of ValueMatcher from StringValueMatcherColumnSelectorStrategy may specialize getRow() because it is monomorphic, then eliminate IndexedInts allocation because it can prove that it doesn't escape, and then essentially run the same code as here, in StringDimensionIndexer.makeIndexingValueMatcher(). But in production getRow() in StringValueMatcherColumnSelectorStrategy won't be inlined, so IndexedInts indirection wrapping int[] array will remain.
So I suggest to add DimensionSelector.makeValueMatcher() anyway, maybe as a separate PR, but necessarily in the same Druid release as this PR.
There was a problem hiding this comment.
I would really like to be able to see this happen in a benchmark, since otherwise it's challenging to prevent performance regressions. A future patch should be adding more kinds of ValueMatcherColumnSelectorStrategies. Do you think it'd work to have two different types of value matchers used in the same benchmark? We could do that in FilteredAggregatorBenchmark.
I'll file an issue for 0.10.0 to look into this.
There was a problem hiding this comment.
Btw, you can cause situations like the one that leventov is describing in benchmarks by running the multiple code paths. Once Java knows that it can be called multiple ways, it will undo whatever specializations it has done that no longer apply. So, in the benchmark, if you actually benchmark things together in the same JVM you can see it.
Fwiw, when you find one of these, it can be fun to watch this happen and watch things speed up and slow down by adjusting the order of what runs first. The first one run is sometimes always fastest, no matter which algorithm it is.
There was a problem hiding this comment.
@cheddar are you saying it would work to have two different types of value matchers in the same benchmark? We only have the one now (String) but we can definitely mix them together in the same benchmark when we add more.
There was a problem hiding this comment.
Yes. You will want to do this before starting the timer on the benchmark and also make sure you run each of the options with enough iterations to cause optimizations to actually occur as well as the backing out of those optimizations. This is generally black magic unless you are running a debug build of the jvm where you can turn on logging of optimizations (something I've read about doing, but never done myself).
- BoundDimFilter: fewer volatiles, rename matchesAnything to !matchesNothing. - HavingSpecs: Clarify that they are not thread-safe, and make DimFilterHavingSpec not thread safe. - Renamed rowType to rowSignature. - Added specializations for time-based vs non-time-based DimensionSelector in RBCSF. - Added convenience method DimensionHanderUtils.createColumnSelectorPlus. - Added singleton ZeroIndexedInts. - Added test cases for DimFilterHavingSpec.
|
@leventov thanks for the review. I just pushed a round of changes, please let me know what you think. |
| final List<String> dimensionValues = row.get().getDimension(dimension); | ||
| final int dimensionValuesSize = dimensionValues != null ? dimensionValues.size() : 0; | ||
|
|
||
| return new IndexedInts() |
There was a problem hiding this comment.
Similarly to ZeroIndexedInts, RangeIndexedInts could be extracted and certain amount of them cached globally.
| @Override | ||
| public Sequence<Row> apply(Sequence<Row> input) | ||
| { | ||
| GroupByQuery.this.havingSpec.setRowSignature(GroupByQueryHelper.rowSignatureFor(GroupByQuery.this)); |
There was a problem hiding this comment.
Maybe makeHavingSpec immutable, by extracting the mutable part of its state into some Predicate<Row> impl returned from some method like HavingSpec.makePredicate(rowSignature)?
There was a problem hiding this comment.
I think this is a good idea although not needed for this PR. It would be a good change if we ever do need to use HavingSpecs in multiple different threads.
| @Override | ||
| public int get(int index) | ||
| { | ||
| return 0; |
There was a problem hiding this comment.
Have you intentionally omitted index check?
There was a problem hiding this comment.
Yes, I don't think the check is adding much value. Reading past the end of a row is not a common bug. I can add the check back though if you think it's worth it.
There was a problem hiding this comment.
Hmm, I can also add an assert so there's a check during tests but not in production. assuming asserts are disabled in production, which I haven't verified.
|
@leventov: pushed new commits adding RangeIndexedInts, the ZeroIndexedInts comment, some tweaks to the concurrent usage guard on DimFilterHavingSpec, and some minor refactorings. |
leventov
left a comment
There was a problem hiding this comment.
Do you prefer to not take this approach: #3797 (comment)?
| this.size = size; | ||
| } | ||
|
|
||
| public static RangeIndexedInts create(final int size) |
There was a problem hiding this comment.
Maybe call this method "acquire()" or "getOrCreate()", because it doesn't necessarily create something?
There was a problem hiding this comment.
I thought the little white lie here would be okay.
|
@leventov I like your suggestion but think it is not necessary for this PR. |
|
👍 |
|
|
||
| // Other fields are not volatile. | ||
| private boolean matchesNothing; | ||
| private boolean hasLowerLongBoundVolatile; |
There was a problem hiding this comment.
BTW now names of these fields shouldn't have "Volatile" suffix.
|
👍, I think @cheddar's suggestion re: volatile/predicate handling in the BoundDimFilter is a nice approach if you feel like updating that in this PR |
|
@jon-wei thanks for the review. I'd rather do that change as a separate one (if at all) since it's only tangentially related to this patch. |
…tchers. (apache#3797) * Filters: Use ColumnSelectorFactory directly for building row-based matchers. * Adjustments based on code review. - BoundDimFilter: fewer volatiles, rename matchesAnything to !matchesNothing. - HavingSpecs: Clarify that they are not thread-safe, and make DimFilterHavingSpec not thread safe. - Renamed rowType to rowSignature. - Added specializations for time-based vs non-time-based DimensionSelector in RBCSF. - Added convenience method DimensionHanderUtils.createColumnSelectorPlus. - Added singleton ZeroIndexedInts. - Added test cases for DimFilterHavingSpec. * Make ValueMatcherColumnSelectorStrategy actually use the associated selector. * Add RangeIndexedInts. * DimFilterHavingSpec: Fix concurrent usage guard on jdk7. * Add assertion to ZeroIndexedInts. * Rename no-longer-volatile members.
This is part 1 of the proposal in https://groups.google.com/d/msg/druid-development/_Sd78s7yU5U/RJ8qLOJ5DwAJ. There is a net reduction in code size, I think that's a good sign!
Main changes:
Filter.makeMatcher(ValueMatcherFactory)toFilter.makeMatcher(ColumnSelectorFactory).Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, DruidPredicateFactory)andFilters.makeValueMatcher(ColumnSelectorFactory, String columnName, String value).Also contained:
Benchmarks below. I chose FilteredAggregatorBenchmark because filtered aggregators use
Filter.makeMatcheron both incremental and queryable indexes.