Attempt to coerce COMPLEX to number in numeric aggregators.#16564
Attempt to coerce COMPLEX to number in numeric aggregators.#16564gianm merged 4 commits intoapache:masterfrom
Conversation
PR apache#15371 eliminated ObjectColumnSelector's built-in implementations of numeric methods, which had been marked deprecated. However, some complex types, like SpectatorHistogram, can be successfully coerced to number. The documentation for spectator histograms encourages taking advantage of this by aggregating complex columns with doubleSum and longSum. Currently, this doesn't work properly for IncrementalIndex, where the behavior relied on those deprecated ObjectColumnSelector methods. This patch fixes the behavior by making two changes: 1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators; all of these extend NullableNumericAggregatorFactory) use getObject for STRING and COMPLEX. Previously, getObject was only used for STRING. 2) NullableNumericAggregatorFactory (base class for simple numeric aggregators) has a new protected method "useGetObject". This allows the base class to correctly check for null (using getObject or isNull). The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex.
| // Check timestamp | ||
| Assert.assertEquals(startOfDay.getMillis(), results.get(0).get(0)); | ||
| // Check doubleSum | ||
| Assert.assertEquals(n * segments.size(), (Double) results.get(0).get(1), 0.001); |
There was a problem hiding this comment.
I think this should be just n regardless of how many segments there are. This should be the count of original input values, however they're spread across segments.
There happens to be a single segment here, so it doesn't affect the test.
There was a problem hiding this comment.
Hmm, the * segments.size() is required for this test to pass (it's 2 here). This was the new test; I think what's happening in this test is the same data is added twice:
ImmutableList<Segment> segments = ImmutableList.of(
new IncrementalIndexSegment(index, SegmentId.dummy("test")),
helper.persistIncrementalIndex(index, null)
);
Since index has a full copy of the dataset, the query on segments see a double-count.
There was a problem hiding this comment.
Oops, that's my bad then. Do we need both:
new IncrementalIndexSegment(index, SegmentId.dummy("test")), helper.persistIncrementalIndex(index, null)
for this test to be meaningful?
I think I'd intended to have just 1 incremental segment. Likely a copy/paste issue from another test.
There was a problem hiding this comment.
I think it's fine to have both the in-memory and the persisted indexes in the test. This way we can test when the query hit both type of indexes.
| public int getLength() | ||
| { | ||
| return index.size(); | ||
| return -1; |
There was a problem hiding this comment.
Why do we want to claim a length of -1 here?
There was a problem hiding this comment.
The method is specced as returning the serialized size of the column in bytes, or -1 if unknown. index.size() returns a row count, which doesn't match the specced behavior. The SpectatorHistogramIndexed doesn't seem to know its own serialized size, and the getLength() method doesn't seem to be used anywhere important, so I figured changing this to -1 was a good idea.
There was a problem hiding this comment.
If it's not important then -1 seems fine.
We can very likely compute (or estimate an upper bound) of the size of the column if it will optimize something elsewhere.
maytasm
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the fix!
|
Thanks for reviewing! |
…6564) * Coerce COMPLEX to number in numeric aggregators. PR apache#15371 eliminated ObjectColumnSelector's built-in implementations of numeric methods, which had been marked deprecated. However, some complex types, like SpectatorHistogram, can be successfully coerced to number. The documentation for spectator histograms encourages taking advantage of this by aggregating complex columns with doubleSum and longSum. Currently, this doesn't work properly for IncrementalIndex, where the behavior relied on those deprecated ObjectColumnSelector methods. This patch fixes the behavior by making two changes: 1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators; all of these extend NullableNumericAggregatorFactory) use getObject for STRING and COMPLEX. Previously, getObject was only used for STRING. 2) NullableNumericAggregatorFactory (base class for simple numeric aggregators) has a new protected method "useGetObject". This allows the base class to correctly check for null (using getObject or isNull). The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex. * Fix tests. * Remove the special ColumnValueSelector. * Add test.
PR #15371 eliminated ObjectColumnSelector's built-in implementations of numeric methods, which had been marked deprecated.
However, some complex types, like SpectatorHistogram, can be successfully coerced to number. The documentation for spectator histograms encourages taking advantage of this by aggregating complex columns with doubleSum and longSum. Currently, this doesn't work properly for IncrementalIndex, where the behavior relied on those deprecated ObjectColumnSelector methods.
This patch fixes the behavior by making two changes:
SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators;
all of these extend NullableNumericAggregatorFactory) use getObject for STRING
and COMPLEX. Previously, getObject was only used for STRING.
NullableNumericAggregatorFactory (base class for simple numeric aggregators)
has a new protected method "useGetObject". This allows the base class to
correctly check for null (using getObject or isNull).
The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex. Thanks @bsyk for the test, which I pulled from #16562.