Skip to content

Attempt to coerce COMPLEX to number in numeric aggregators.#16564

Merged
gianm merged 4 commits intoapache:masterfrom
gianm:aggs-coerce-complex
Jul 25, 2024
Merged

Attempt to coerce COMPLEX to number in numeric aggregators.#16564
gianm merged 4 commits intoapache:masterfrom
gianm:aggs-coerce-complex

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Jun 6, 2024

PR #15371 eliminated ObjectColumnSelector's built-in implementations of numeric methods, which had been marked deprecated.

However, some complex types, like SpectatorHistogram, can be successfully coerced to number. The documentation for spectator histograms encourages taking advantage of this by aggregating complex columns with doubleSum and longSum. Currently, this doesn't work properly for IncrementalIndex, where the behavior relied on those deprecated ObjectColumnSelector methods.

This patch fixes the behavior by making two changes:

  1. SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators;
    all of these extend NullableNumericAggregatorFactory) use getObject for STRING
    and COMPLEX. Previously, getObject was only used for STRING.

  2. NullableNumericAggregatorFactory (base class for simple numeric aggregators)
    has a new protected method "useGetObject". This allows the base class to
    correctly check for null (using getObject or isNull).

The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex. Thanks @bsyk for the test, which I pulled from #16562.

PR apache#15371 eliminated ObjectColumnSelector's built-in implementations of
numeric methods, which had been marked deprecated.

However, some complex types, like SpectatorHistogram, can be successfully coerced
to number. The documentation for spectator histograms encourages taking advantage of
this by aggregating complex columns with doubleSum and longSum. Currently, this
doesn't work properly for IncrementalIndex, where the behavior relied on those
deprecated ObjectColumnSelector methods.

This patch fixes the behavior by making two changes:

1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators;
   all of these extend NullableNumericAggregatorFactory) use getObject for STRING
   and COMPLEX. Previously, getObject was only used for STRING.

2) NullableNumericAggregatorFactory (base class for simple numeric aggregators)
   has a new protected method "useGetObject". This allows the base class to
   correctly check for null (using getObject or isNull).

The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex.
@gianm gianm added the Bug label Jun 6, 2024
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 6, 2024

@bsyk @maytasm note I changed the last assert from #16562 from:

Assert.assertEquals(n, (Double) results.get(0).get(1), 0.001);

to:

Assert.assertEquals(n * segments.size(), (Double) results.get(0).get(1), 0.001);

Please let me know if this looks right.

// Check timestamp
Assert.assertEquals(startOfDay.getMillis(), results.get(0).get(0));
// Check doubleSum
Assert.assertEquals(n * segments.size(), (Double) results.get(0).get(1), 0.001);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be just n regardless of how many segments there are. This should be the count of original input values, however they're spread across segments.
There happens to be a single segment here, so it doesn't affect the test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the * segments.size() is required for this test to pass (it's 2 here). This was the new test; I think what's happening in this test is the same data is added twice:

    ImmutableList<Segment> segments = ImmutableList.of(
        new IncrementalIndexSegment(index, SegmentId.dummy("test")),
        helper.persistIncrementalIndex(index, null)
    );

Since index has a full copy of the dataset, the query on segments see a double-count.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's my bad then. Do we need both:

new IncrementalIndexSegment(index, SegmentId.dummy("test")), helper.persistIncrementalIndex(index, null)

for this test to be meaningful?
I think I'd intended to have just 1 incremental segment. Likely a copy/paste issue from another test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have both the in-memory and the persisted indexes in the test. This way we can test when the query hit both type of indexes.

public int getLength()
{
return index.size();
return -1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to claim a length of -1 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is specced as returning the serialized size of the column in bytes, or -1 if unknown. index.size() returns a row count, which doesn't match the specced behavior. The SpectatorHistogramIndexed doesn't seem to know its own serialized size, and the getLength() method doesn't seem to be used anywhere important, so I figured changing this to -1 was a good idea.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not important then -1 seems fine.
We can very likely compute (or estimate an upper bound) of the size of the column if it will optimize something elsewhere.

Copy link
Copy Markdown
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jul 25, 2024

Thanks for reviewing!

@gianm gianm merged commit b2a88da into apache:master Jul 25, 2024
@gianm gianm deleted the aggs-coerce-complex branch July 25, 2024 15:45
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
…6564)

* Coerce COMPLEX to number in numeric aggregators.

PR apache#15371 eliminated ObjectColumnSelector's built-in implementations of
numeric methods, which had been marked deprecated.

However, some complex types, like SpectatorHistogram, can be successfully coerced
to number. The documentation for spectator histograms encourages taking advantage of
this by aggregating complex columns with doubleSum and longSum. Currently, this
doesn't work properly for IncrementalIndex, where the behavior relied on those
deprecated ObjectColumnSelector methods.

This patch fixes the behavior by making two changes:

1) SimpleXYZAggregatorFactory (XYZ = type; base class for simple numeric aggregators;
   all of these extend NullableNumericAggregatorFactory) use getObject for STRING
   and COMPLEX. Previously, getObject was only used for STRING.

2) NullableNumericAggregatorFactory (base class for simple numeric aggregators)
   has a new protected method "useGetObject". This allows the base class to
   correctly check for null (using getObject or isNull).

The patch also adds a test for SpectatorHistogram + doubleSum + IncrementalIndex.

* Fix tests.

* Remove the special ColumnValueSelector.

* Add test.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants