ARROW-9516: [Rust][DataFusion] refactor of column names #7796

jorgecarleitao · 2020-07-18T06:23:17Z

This PR is respective to ARROW-9516.

It:

simplifies how we construct and handle columns: all columns are now name-based, not index-based. This simplifies our code base significantly.
makes all column naming happen on the logical plan, not physical plan
gives more expressive column names to the end schema of a logical plan, particularly to aggregated expressions (e.g. SUM(a), SUM(b) instead of SUM, SUM.

This is currently a proof of value: all tests pass and stuff, but there is no decision made on whether we should proceed with these changes. More details available at ARROW-9516.

github-actions · 2020-07-18T06:31:52Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

jorgecarleitao · 2020-07-18T06:40:45Z

rust/datafusion/src/execution/context.rs

This change was necessary because the file's schema is c1, c2, not state, salary. We were able to get away with this because, since the names did not mean anything, we could read a file with a given schema using another schema's field names, as long as the indexing was correct. I believe that this should not be possible by design, as it introduces situations that are non-trivial to debug.

jorgecarleitao · 2020-07-18T06:44:46Z

rust/datafusion/src/logicalplan.rs

This was already the case - I am just making it explicit in this comment.

github-actions · 2020-07-18T07:31:45Z

https://issues.apache.org/jira/browse/ARROW-9516

jorgecarleitao · 2020-07-20T20:24:47Z

@andygrove fyi

andygrove · 2020-07-22T13:46:26Z

rust/datafusion/src/execution/context.rs

It looks like we're losing the alias name here?

So we just need to fix this so that the physical expression is created using the aliased name.

No, for the same reason: the physical plan does not care about names anymore, and aliases are only about naming.

fyi my comment was written before I read your second comment;

these pages are definitely not thread safe!

Ok, so the schema for these physical operators is based on the logical schema? I guess I just want to be sure that the alias name is preserved in the query results.

A unit test demonstrating this would be good.

(I wrote my other comment because I became uncertain about my claim and wanted to double check ^_^: yes, good to have this triple checked; this is a big change)

Let's see:

the schema of the logical plan is built on the column's names as per this line, which is then passed to the physical plan as per line of this discussion. So, it should be the case...

line 1089 of context.rs IMO demonstrates this: we alias an expression in the logical plan, and the physical's plan schema's field of that column is the alias itself.

Thanks. I just pulled your branch and modified one of the integration tests just to be sure, and it looks good. This really helps simplify some things, thanks!

Could you rebase?

:) perfect. I have rebased this against the latest master.

rust/datafusion/src/execution/physical_plan/expressions.rs

andygrove · 2020-07-22T13:51:38Z

~~This is looking good overall. I think we need to keep the Alias in the logical plan though.~~

jorgecarleitao · 2020-07-23T03:21:35Z

The test csv_query_group_by_avg_with_projection is now failing (after rebase).

The query is SELECT avg(c12), c1 FROM aggregate_test_100 GROUP BY c1 and the error is that the order of the result is wrong: it is c1, avg(c12), not c1, avg(c12). I am looking into it.

EDIT 1: The root cause is that projection_expr is not passed to the LogicalPlanBuilder from the sql planner, and thus there is no way for the LogicalPlanBuilder to know which order to apply when building the schema (it uses the order: groups first, aggregations second)

jorgecarleitao · 2020-07-24T07:00:00Z

@andygrove , this is IMO now ready. There were two errors that I fixed in two separate commits:

3503c0d fixed an error caused by a mistake of mine
ff3ade9 fixed an error in one of the tests

The first one was related to the need of adding a projection when the order of SELECT is different from [groups] + [aggregates] in a group by. The current code has this and I had removed that part by mistake in this PR. Also, since this was only being tested in the integration tests, I have added a new test on the library to validate the logical plan.

The second issue is IMO an error in one of the tests: it was parsing SELECT c2, MIN(c12), MAX(c12) FROM aggregate_test_100 GROUP BY c1 as a valid statement and interpreting that statement as c2 representing c1. With this PR, that is identified as an error because c2 is not a valid column name in this context.

andygrove · 2020-07-24T12:56:16Z

@jorgecarleitao Thanks. It looks like you will have to rebase one more time because master was rebased as part of the 1.0.0 release process. I can merge this once that is done.

jorgecarleitao · 2020-07-24T13:07:35Z

@andygrove done. Thanks a lot for the effort and help!

andygrove

LGTM!

andygrove · 2020-07-24T21:38:02Z

@jorgecarleitao The parquet version fix is merged, so one more rebase should get the tests passing.

Columns are no longer identified by its index, but by its name. This is induced by the following assumption: every table that we scan has a unique column name. This greatly simplifies the code and the public API of physical plans logical plans. This also greatly simplifies the projection push down, and deprecates the ResolveColumns.

This is no longer needed.

jorgecarleitao · 2020-07-24T21:41:58Z

rebased

jorgecarleitao commented Jul 18, 2020

View reviewed changes

rust/datafusion/src/logicalplan.rs Outdated

Copy link

Member Author

jorgecarleitao Jul 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already the case - I am just making it explicit in this comment.

jorgecarleitao changed the title ~~WIP: refactor of column names~~ ARROW-9516: [Rust][DataFusion] refactor of column names Jul 18, 2020

jorgecarleitao marked this pull request as ready for review July 18, 2020 13:38

sunchao added Component: Rust - DataFusion Component: Rust labels Jul 18, 2020

andygrove reviewed Jul 22, 2020

View reviewed changes

rust/datafusion/src/execution/physical_plan/expressions.rs Outdated Show resolved Hide resolved

andygrove requested review from nevi-me and paddyhoran July 22, 2020 13:51

kszucs force-pushed the master branch from a06a0f4 to baf2094 Compare July 24, 2020 12:28

andygrove approved these changes Jul 24, 2020

View reviewed changes

jorgecarleitao mentioned this pull request Jul 24, 2020

ARROW-9555: [Rust] [DataFusion] Added inner join #7830

Closed

jorgecarleitao added 8 commits July 24, 2020 23:39

Removed ResolveColumnsRule.

5263bfb

This is no longer needed.

Simplified ProjectionPushDown.

c3da7a9

Fixed import.

fd96134

Moved column naming to the logical plan.

e887869

Added testing of expression naming.

308393d

Fixed error in edge case of order in group by.

4a5b62a

Fixed error in SQL statement of test.

a5558e9

andygrove closed this in cd503c3 Jul 25, 2020

jorgecarleitao deleted the column_names branch October 28, 2020 04:17

asfimport mentioned this pull request Jul 28, 2020

[Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes #25584

Closed

ARROW-9516: [Rust][DataFusion] refactor of column names #7796

ARROW-9516: [Rust][DataFusion] refactor of column names #7796

Uh oh!

Conversation

jorgecarleitao commented Jul 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 18, 2020

Uh oh!

jorgecarleitao Jul 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jul 18, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 18, 2020

Uh oh!

jorgecarleitao commented Jul 20, 2020

Uh oh!

andygrove Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Jul 22, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Jul 23, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao commented Jul 24, 2020

Uh oh!

andygrove commented Jul 24, 2020

Uh oh!

jorgecarleitao commented Jul 24, 2020

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jul 24, 2020

Uh oh!

jorgecarleitao commented Jul 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgecarleitao commented Jul 18, 2020 •

edited

Loading

jorgecarleitao Jul 18, 2020 •

edited

Loading

jorgecarleitao Jul 22, 2020 •

edited

Loading

jorgecarleitao Jul 22, 2020 •

edited

Loading

andygrove commented Jul 22, 2020 •

edited

Loading

jorgecarleitao commented Jul 23, 2020 •

edited

Loading