Fixes, adjustments to numeric null handling and string first/last aggregators.#8834
Merged
fjy merged 1 commit intoapache:masterfrom Nov 8, 2019
Merged
Conversation
Contributor
|
Wow, that is a lot of changes. You basically rewrote them. 🚀 |
Contributor
|
@gianm I think this is failing UT |
01fd1ab to
d82c209
Compare
Contributor
Author
|
I had to adjust the design somewhat to keep things working at ingest time. I added a new test for this too, to ensure it works. The branch and original description is now updated. |
eca9447 to
08f6426
Compare
…regators. There is a class of bugs due to the fact that BaseObjectColumnValueSelector has both "getObject" and "isNull" methods, but in most selector implementations and most call sites, it is clear that the intent of "isNull" is only to apply to the primitive getters, not the object getter. This makes sense, because the purpose of isNull is to enable detection of nulls in otherwise-primitive columns. Imagine a string column with a numeric selector built on top of it. You would want it to return isNull = true, so numeric aggregators don't treat it as all zeroes. Sometimes this design leads people to accidentally guard non-primitive get methods with "selector.isNull" checks, which is improper. This patch has three goals: 1) Fix null-handling bugs that already exist in this class. 2) Make interface and doc changes that reduce the probability of future bugs. 3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast aggregators while fixing null-handling bugs. I thought about splitting this into its own patch, but it ended up being tough to split from the null-handling fixes. For (1) the fixes are, - Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject calls on isNull, by no longer extending NullableAggregatorFactory. Now uses -1 as a sigil value for null, to differentiate nulls and empty strings. - Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use eval.asBoolean() to avoid calling getLong on the selector after already calling getObject. - Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls on isNull. Also, refactored slightly to avoid the overhead of calling getObject followed by another getter (see BloomFilterAggregatorFactory for part of this). For (2) the main changes are, - Remove the "isNull" method from BaseObjectColumnValueSelector. - Clarify "isNull" doc on BaseNullableColumnValueSelector. - Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize that it only works on aggregators that take numbers as input. - Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner. - Similar naming changes to helper methods for groupBy, ValueMatchers, etc. For (3) the other fixes for StringFirst and StringLastAggregatorFactory are, - Fixed buffer overrun in the buffer aggregators when some characters in the string code into more than one byte (the old code used "substring" to apply a byte limit, which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method. - Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp. - Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex behavior. - Refactored to share code between the four aggregators. - Improved test coverage. - Made the base stringFirst, stringLast aggregators adaptive, and streamlined the xFold versions into aliases. The adaptiveness is similar to how other aggregators like hyperUnique work.
08f6426 to
fe34b51
Compare
|
This pull request introduces 1 alert when merging fe34b51 into b03aa06 - view on LGTM.com new alerts:
|
clintropolis
approved these changes
Nov 8, 2019
| } else if (object instanceof Float) { | ||
| BloomKFilter.addFloat(buf, (float) object); | ||
| } else if (object instanceof String) { | ||
| BloomKFilter.addString(buf, (String) object); |
Member
There was a problem hiding this comment.
👍 this is a good change I think, moving the DimensionSelector out of here. This doesn't need to handle List since it doesn't work at ingestion time i think, and i can't think of anywhere else they would come from without a column capabilities.
|
|
||
| if (selector instanceof NilColumnValueSelector) { | ||
| return new NoopBloomFilterAggregator(maxNumEntries, onHeap); | ||
| } else if (selector instanceof DimensionSelector) { |
|
|
||
| /** | ||
| * Encodes "string" into the buffer "byteBuffer", using no more than the number of bytes remaining in the buffer. | ||
| * Will only encode whole characters. The byteBuffer's position and limit be changed during operation, but will |
Member
There was a problem hiding this comment.
nit: ... limit can be changed ... or ... may be ...?
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.
Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.
This patch has three goals:
aggregators while fixing null-handling bugs. I thought about splitting this
into its own patch, but it ended up being tough to split from the
null-handling fixes.
For (1) the fixes are,
calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
-1 as a sigil value for null, to differentiate nulls and empty strings.
eval.asBoolean() to avoid calling getLong on the selector after already
calling getObject.
on isNull. Also, refactored slightly to avoid the overhead of calling
getObject followed by another getter (see BloomFilterAggregatorFactory for
part of this).
For (2) the main changes are,
that it only works on aggregators that take numbers as input.
For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,
code into more than one byte (the old code used "substring" to apply a byte limit,
which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
behavior.
xFold versions into aliases. The adaptiveness is similar to how other aggregators
like hyperUnique work.