ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking. #8024

jorgecarleitao · 2020-08-22T15:51:31Z

This commit makes all type coercion happen on the physical plane instead of logical plane and fixes the supertype function. This makes field names to not change due to coercion rules, better control of how the coercion supports physical calculations, and others.

This commit also makes it more clear how we enforce type checking during planning. the Logical plan now knows how to derive its schema directly from binary expressions, even before the coercion is applied.

The rational for this change is that coercions are simplifications to a physical computation (it is easier to sum two numbers of the same type at the hardware level).

This partially solves ARROW-9809 (for binary expressions, not for udfs), an issue on which the physical schema could be modified by coercion rules, causing the RecordBatch's schema to be different from the logical batch.

This also addresses some inconsistencies in how we coerced certain types for binary operators, causing such inconsistencies to error during planning instead of execution.

This also introduces a significant number of tests into the overall consistency of binary operators: it is now explicit what types they expect and how coercion happens to each operator. It also adds tests to different parts of the physical execution, to ensure schema consistency for binary operators, including negative tests (when it should error).

This also makes like and nlike generally available, and added some tests to it.

This closes ARROW-4957.

@andygrove and @alamb, I am really sorry for this long commit, but I was unable to split this in smaller parts with passing tests. There was a strong coupling between the get_supertype and the physical expressions that made it hard to work this through.

github-actions · 2020-08-22T16:05:49Z

https://issues.apache.org/jira/browse/ARROW-9809

rust/datafusion/src/execution/physical_plan/expressions.rs

andygrove

This is a fantastic improvement! Thanks @jorgecarleitao

This commit makes all type coercion happen on the physical plane instead of logical plane. This allows field names to not change due to coercion rules. The rational for this change is that coercions are simplifications to a physical computation (it is easier to sum two numbers of the same type at the hardware level). This commit essentially makes the logical plane to not worry about type coercion, only about the resulting type of the operator. This also addresses an issue on which the physical schema could be modified by coercion rules, causing the RecordBatch's schema to be different from the logical batch. This also addresses some inconsistencies in how we coerced certain types for binary operators, causing such inconsistencies to error during planning instead of during execution. This closes ARROW-9809 and ARROW-4957.

alamb

I can't say I totally follow all this code and I didn't study the diff all that carefully, but I also don't have any opposition to merging this. The PR also increases test coverage, so I say

In general, architecturally it sounds a little strange to me to postpone type coercion (also known as type resolution) to physical planning, as I think the information is useful during logical planning (e.g it is important to know if we want to do partial evaluation such as turning A < 5 OR A = 5 into A <= 5 I think you need to know the actual types of A and 5.

However, since most of the logic operates on DataType which is shared between Logical and Physical plans, I think we can always move where exactly the code is executed (any maybe even run it in both places).

jorgecarleitao · 2020-08-23T12:48:05Z

@alamb , thanks a lot for that insight.

I may have been using the wrong notation here.

I think that we have each columns' type during logical planning: the LogicalPlanBuilder always starts with a scan with a well defined (or infered via scan) schema. When a projection is constructed, which requires us to derive a schema, we build that schema by deriving the column types from its expressions, via exprlist_to_fields (that uses Expr::to_field that uses Expr::get_type(input_schema)).

As I see it, the type coercer optimizer is casting types being passed to binary operators for the sole purpose of matching numerical types to perform computations, as we do not have kernels for different numerical types (e.g. u16 + u32).

andygrove added Component: Rust Component: Rust - DataFusion labels Aug 22, 2020

andygrove reviewed Aug 22, 2020

View reviewed changes

rust/datafusion/src/execution/physical_plan/expressions.rs Outdated Show resolved Hide resolved

andygrove approved these changes Aug 22, 2020

View reviewed changes

jorgecarleitao mentioned this pull request Aug 23, 2020

ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument #7967

Closed

alamb reviewed Aug 23, 2020

View reviewed changes

andygrove closed this in 735c870 Aug 23, 2020

jorgecarleitao deleted the fix_types branch August 23, 2020 21:46

alamb mentioned this pull request Nov 26, 2021

The framework about expression type coercion apache/datafusion#1356

Closed

7 tasks

asfimport mentioned this pull request Sep 12, 2020

[Rust] [DataFusion] logical schema = physical schema is not true #25852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking. #8024

ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking. #8024

Uh oh!

jorgecarleitao commented Aug 22, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Aug 22, 2020

Uh oh!

Uh oh!

andygrove left a comment

Uh oh!

alamb left a comment

Uh oh!

jorgecarleitao commented Aug 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking. #8024

ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking. #8024

Uh oh!

Conversation

jorgecarleitao commented Aug 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 22, 2020

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Aug 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgecarleitao commented Aug 22, 2020 •

edited

Loading