Skip to content

Frame writers: Coerce numeric and array types in certain cases.#16994

Merged
gianm merged 9 commits intoapache:masterfrom
gianm:frame-typecast-selector
Sep 6, 2024
Merged

Frame writers: Coerce numeric and array types in certain cases.#16994
gianm merged 9 commits intoapache:masterfrom
gianm:frame-typecast-selector

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Sep 3, 2024

This patch adds TypeCastSelectors, which is used when writing frames to perform two coercions:

  • When a numeric type is desired and the underlying type is non-numeric or unknown, the underlying selector is wrapped, getObject is called and the result is coerced using ExprEval.ofType. This differs from the prior behavior where the primitive methods like getLong, getDouble, etc, would be called directly. This fixes an issue where a column would be read as all-zeroes when its SQL type is numeric and its physical type is string, which can happen when evolving a column's type from string to number.

  • When an array type is desired, the underlying selector is wrapped, getObject is called, and the result is coerced to Object[]. This coercion replaces some earlier logic from Numeric array support for columnar frames #15917.

This mirrors similar logic in numeric aggregators. (The same method
from AggregatorUtil is even used to determine when to apply the logic.)

The idea is that when an underlying selector is STRING or COMPLEX typed,
we should call getObject and cast the result to number, rather than using
the primitive numeric accessor methods.

This fixes an issue where a column would be read as all-zeroes when its
SQL type is numeric, and its physical type is string. This can happen when
evolving a column's type from string to number.
@gianm gianm added Bug Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Sep 3, 2024
public double getDouble()
{
final Number n = computeIfNeeded();
return n == null ? NullHandling.ZERO_DOUBLE : n.floatValue();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this call n.doubleValue()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured since this is the "to float" selector, it should use float precision even if the caller is requesting a double.

Fwiw, the caller really shouldn't be requesting a double anyway (since it's a "to float" selector). I have tests for this method but I don't expect it to be called.

return ((Number) obj).doubleValue();
} else {
final ExprEval<?> eval = ExprEval.bestEffortOf(obj);
return eval.isNumericNull() ? null : eval.asDouble();
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we through a parse exception here ?

Going though the exprEval code, it seems that if a string expression eval is returned and its not a valid double, then a null is returned.
Shoudn't we throw an exception and force the user to add a cast to is sql statement ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @clintropolis , the automatic type conversion is used all over the processing stack and the ingestion stack. Hence, parseException may not be correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this sort of coercion behavior is standard in the query stack when reading from selectors that don't match the desired type. It's how we accomplish schema evolution.


switch (desiredType.getType()) {
case LONG:
return new ObjectToLongColumnValueSelector(selector, rowIdSupplier);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a little bit nicer to get the column capabilities from the selector factory so we know the 'input' column type and pass that into these ObjectToType wrapper selectors so that they could use ExprEval.ofType instead of ExprEval.bestEffortOf which should be a bit more efficient since the latter is mostly these days reserved for uses where we do not know the input type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'll change it.

@gianm gianm changed the title Frame writers: use getObject when reading string/complex types. Frame writers: cast when reading numbers/arrays from selectors of different underlying type. Sep 4, 2024
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Sep 4, 2024

Updated PR to:

  • Use ExprEval.ofType rather than bestEffortOf
  • Move array typecasting logic into TypeCastSelectors, instead of the field writers.
  • Update an IT that was failing because a double_col: 0 turned into a double_col: null when double_col was actually stored as STRING with value extra. This is expected.

{
final Object obj = selector.getObject();
if (obj == null) {
return ExprEval.of(null);
Copy link
Copy Markdown
Member

@clintropolis clintropolis Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use ExprEval.ofType(selectorType, null) so it doesn't become a ExpressionType.STRING, but also probably doesn't matter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still does become a string if we do that, b/c of how the code flow works through ExprEval.ofType. Maybe it shouldn't? But it does.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if type is null it becomes a string, but if the value is null it is an ExprEval of that type with a null value

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Sep 4, 2024

I had to expand the scope of this patch to fix a failing test case in FrameBasedInlineDataSourceSerializerTest. This part is added:

This patch also fixes up selectors RowBasedColumnSelectorFactory#makeColumnValueSelector to adhere to the contract of BaseObjectColumnValueSelector#getObject. In particular, the column selector factory now enforces the requirement that Number is returned for columns of numeric type, and Object[] is returned for selectors of array type. This is connected to the above change, because frame writers now rely on this contract being adhered to.

@gianm gianm changed the title Frame writers: cast when reading numbers/arrays from selectors of different underlying type. Additional object coercion in FrameWriters, RowBasedColumnSelectorFactory. Sep 4, 2024
@gianm gianm changed the title Additional object coercion in FrameWriters, RowBasedColumnSelectorFactory. FrameWriters: Coerce numeric and array types in certain cases. Sep 5, 2024
@gianm gianm changed the title FrameWriters: Coerce numeric and array types in certain cases. Frame writers: Coerce numeric and array types in certain cases. Sep 5, 2024
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Sep 5, 2024

I just pushed a new approach that no longer modifies RowBasedColumnSelectorFactory. The blast radius of that change was a little more than I bargained for. Instead, I updated TypeCastSelectors to apply the array coercion always, even if the underlying selector claims to offer the same array type that we want. I do think it'd be good to update RowBasedColumnSelectorFactory at some point though, since it's currently used in ways that violate the BaseObjectColumnValueSelector#getObject contract. (in particular, when the ColumnInspector declares type ARRAY<X> for a column, but the RowAdapter does not provide Object[].)

When merging this patch, please use the PR title + description for the commit message. Letting GitHub use the list of commits for the message is going to be too confusing, since the approach changed a bunch of times.

} else if (type == ValueType.ARRAY) {
final TypeSignature<ValueType> elementType = desiredType.getElementType();

if (obj instanceof List) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use ExprEval.bestEffortArray here i suppose, though ExprEval.ofType if desiredType is an array, as well as ExprEval.bestEffortOf should have pretty similar logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #16994 (comment), I pushed a new commit that replaces this method with ExprEval.ofType.

* @param desiredType desired type
*/
@Nullable
public static Object bestEffortCoerce(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method seems odd to me, it is basicaly like ExprEval.ofType which separates handling based on the expected ExpressionType (instead of ColumnType like is used here), but then uses ExprEval.bestEffortOf inside of the cases (which uses a much larger set of instanceof checks than ExprEval.ofType).

What sets this method apart from the others?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new commit that replaces this method with ExprEval.ofType. It seems to do what we need. I didn't try it at first, because I misunderstood what it does— I didn't realize it also included coercion. I added javadoc to ExprEval.ofType explaining that it does do coercion.

Comment on lines +522 to +527
/**
* Create an eval of the provided type. Coerces the provided object to the desired type.
*
* @param type type, or null to be equivalent to {@link #bestEffortOf(Object)}
* @param value object to be coerced to the type
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gianm gianm merged commit 175636b into apache:master Sep 6, 2024
@gianm gianm deleted the frame-typecast-selector branch September 6, 2024 00:20
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants