Skip to content

Fixes, adjustments to numeric null handling and string first/last aggregators.#8834

Merged
fjy merged 1 commit intoapache:masterfrom
gianm:fix-nulls-stringfirstlast
Nov 8, 2019
Merged

Fixes, adjustments to numeric null handling and string first/last aggregators.#8834
fjy merged 1 commit intoapache:masterfrom
gianm:fix-nulls-stringfirstlast

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Nov 6, 2019

There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.

Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.

This patch has three goals:

  1. Fix null-handling bugs that already exist in this class.
  2. Make interface and doc changes that reduce the probability of future bugs.
  3. Fix other, unrelated bugs I noticed in the stringFirst and stringLast
    aggregators while fixing null-handling bugs. I thought about splitting this
    into its own patch, but it ended up being tough to split from the
    null-handling fixes.

For (1) the fixes are,

  • Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
    calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
    -1 as a sigil value for null, to differentiate nulls and empty strings.
  • Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
    eval.asBoolean() to avoid calling getLong on the selector after already
    calling getObject.
  • Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
    on isNull. Also, refactored slightly to avoid the overhead of calling
    getObject followed by another getter (see BloomFilterAggregatorFactory for
    part of this).

For (2) the main changes are,

  • Remove the "isNull" method from BaseObjectColumnValueSelector.
  • Clarify "isNull" doc on BaseNullableColumnValueSelector.
  • Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
    that it only works on aggregators that take numbers as input.
  • Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
  • Similar naming changes to helper methods for groupBy, ValueMatchers, etc.

For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,

  • Fixed buffer overrun in the buffer aggregators when some characters in the string
    code into more than one byte (the old code used "substring" to apply a byte limit,
    which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
  • Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
  • Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
    behavior.
  • Refactored to share code between the four aggregators.
  • Improved test coverage.
  • Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
    xFold versions into aliases. The adaptiveness is similar to how other aggregators
    like hyperUnique work.

@gianm gianm added the Bug label Nov 6, 2019
@vogievetsky
Copy link
Copy Markdown
Contributor

Wow, that is a lot of changes. You basically rewrote them. 🚀

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Nov 7, 2019

@gianm I think this is failing UT

@gianm gianm force-pushed the fix-nulls-stringfirstlast branch from 01fd1ab to d82c209 Compare November 7, 2019 18:56
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Nov 7, 2019

I had to adjust the design somewhat to keep things working at ingest time. I added a new test for this too, to ensure it works. The branch and original description is now updated.

@gianm gianm force-pushed the fix-nulls-stringfirstlast branch 2 times, most recently from eca9447 to 08f6426 Compare November 7, 2019 20:27
…regators.

There is a class of bugs due to the fact that BaseObjectColumnValueSelector
has both "getObject" and "isNull" methods, but in most selector implementations
and most call sites, it is clear that the intent of "isNull" is only to apply
to the primitive getters, not the object getter. This makes sense, because the
purpose of isNull is to enable detection of nulls in otherwise-primitive columns.
Imagine a string column with a numeric selector built on top of it. You would
want it to return isNull = true, so numeric aggregators don't treat it as
all zeroes.

Sometimes this design leads people to accidentally guard non-primitive get
methods with "selector.isNull" checks, which is improper.

This patch has three goals:

1) Fix null-handling bugs that already exist in this class.
2) Make interface and doc changes that reduce the probability of future bugs.
3) Fix other, unrelated bugs I noticed in the stringFirst and stringLast
   aggregators while fixing null-handling bugs. I thought about splitting this
   into its own patch, but it ended up being tough to split from the
   null-handling fixes.

For (1) the fixes are,

- Fix StringFirst and StringLastAggregatorFactory to stop guarding getObject
  calls on isNull, by no longer extending NullableAggregatorFactory. Now uses
  -1 as a sigil value for null, to differentiate nulls and empty strings.
- Fix ExpressionFilter to stop guarding getObject calls on isNull. Also, use
  eval.asBoolean() to avoid calling getLong on the selector after already
  calling getObject.
- Fix ObjectBloomFilterAggregator to stop guarding DimensionSelector calls
  on isNull. Also, refactored slightly to avoid the overhead of calling
  getObject followed by another getter (see BloomFilterAggregatorFactory for
  part of this).

For (2) the main changes are,

- Remove the "isNull" method from BaseObjectColumnValueSelector.
- Clarify "isNull" doc on BaseNullableColumnValueSelector.
- Rename NullableAggregatorFactory -> NullbleNumericAggregatorFactory to emphasize
  that it only works on aggregators that take numbers as input.
- Similar naming changes to the Aggregator, BufferAggregator, and AggregateCombiner.
- Similar naming changes to helper methods for groupBy, ValueMatchers, etc.

For (3) the other fixes for StringFirst and StringLastAggregatorFactory are,

- Fixed buffer overrun in the buffer aggregators when some characters in the string
  code into more than one byte (the old code used "substring" to apply a byte limit,
  which is bad). I did this by introducing a new StringUtils.toUtf8WithLimit method.
- Fixed weird IncrementalIndex logic that led to reading nulls for the timestamp.
- Adjusted weird StringFirst/Last logic that worked around the weird IncrementalIndex
  behavior.
- Refactored to share code between the four aggregators.
- Improved test coverage.
- Made the base stringFirst, stringLast aggregators adaptive, and streamlined the
  xFold versions into aliases. The adaptiveness is similar to how other aggregators
  like hyperUnique work.
@gianm gianm force-pushed the fix-nulls-stringfirstlast branch from 08f6426 to fe34b51 Compare November 7, 2019 22:38
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 7, 2019

This pull request introduces 1 alert when merging fe34b51 into b03aa06 - view on LGTM.com

new alerts:

  • 1 for Useless null check

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍

} else if (object instanceof Float) {
BloomKFilter.addFloat(buf, (float) object);
} else if (object instanceof String) {
BloomKFilter.addString(buf, (String) object);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is a good change I think, moving the DimensionSelector out of here. This doesn't need to handle List since it doesn't work at ingestion time i think, and i can't think of anywhere else they would come from without a column capabilities.


if (selector instanceof NilColumnValueSelector) {
return new NoopBloomFilterAggregator(maxNumEntries, onHeap);
} else if (selector instanceof DimensionSelector) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


/**
* Encodes "string" into the buffer "byteBuffer", using no more than the number of bytes remaining in the buffer.
* Will only encode whole characters. The byteBuffer's position and limit be changed during operation, but will
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ... limit can be changed ... or ... may be ...?

@fjy fjy merged commit c204d68 into apache:master Nov 8, 2019
@gianm gianm added this to the 0.17.0 milestone Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants