Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Jul 18, 2020

This PR is respective to ARROW-9516.

It:

  1. simplifies how we construct and handle columns: all columns are now name-based, not index-based. This simplifies our code base significantly.
  2. makes all column naming happen on the logical plan, not physical plan
  3. gives more expressive column names to the end schema of a logical plan, particularly to aggregated expressions (e.g. SUM(a), SUM(b) instead of SUM, SUM.

This is currently a proof of value: all tests pass and stuff, but there is no decision made on whether we should proceed with these changes. More details available at ARROW-9516.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Member Author

@jorgecarleitao jorgecarleitao Jul 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was necessary because the file's schema is c1, c2, not state, salary. We were able to get away with this because, since the names did not mean anything, we could read a file with a given schema using another schema's field names, as long as the indexing was correct. I believe that this should not be possible by design, as it introduces situations that are non-trivial to debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already the case - I am just making it explicit in this comment.

@jorgecarleitao jorgecarleitao changed the title WIP: refactor of column names ARROW-9516: [Rust][DataFusion] refactor of column names Jul 18, 2020
@github-actions
Copy link

@jorgecarleitao
Copy link
Member Author

@andygrove fyi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're losing the alias name here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we just need to fix this so that the physical expression is created using the aliased name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for the same reason: the physical plan does not care about names anymore, and aliases are only about naming.

Copy link
Member Author

@jorgecarleitao jorgecarleitao Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi my comment was written before I read your second comment;

these pages are definitely not thread safe!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the schema for these physical operators is based on the logical schema? I guess I just want to be sure that the alias name is preserved in the query results.

A unit test demonstrating this would be good.

Copy link
Member Author

@jorgecarleitao jorgecarleitao Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I wrote my other comment because I became uncertain about my claim and wanted to double check ^_^: yes, good to have this triple checked; this is a big change)

Let's see:

  1. the schema of the logical plan is built on the column's names as per this line, which is then passed to the physical plan as per line of this discussion. So, it should be the case...
  2. line 1089 of context.rs IMO demonstrates this: we alias an expression in the logical plan, and the physical's plan schema's field of that column is the alias itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just pulled your branch and modified one of the integration tests just to be sure, and it looks good. This really helps simplify some things, thanks!

Could you rebase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) perfect. I have rebased this against the latest master.

@andygrove
Copy link
Member

andygrove commented Jul 22, 2020

This is looking good overall. I think we need to keep the Alias in the logical plan though.

@andygrove andygrove requested review from nevi-me and paddyhoran July 22, 2020 13:51
@jorgecarleitao
Copy link
Member Author

jorgecarleitao commented Jul 23, 2020

The test csv_query_group_by_avg_with_projection is now failing (after rebase).

The query is SELECT avg(c12), c1 FROM aggregate_test_100 GROUP BY c1 and the error is that the order of the result is wrong: it is c1, avg(c12), not c1, avg(c12). I am looking into it.

EDIT 1: The root cause is that projection_expr is not passed to the LogicalPlanBuilder from the sql planner, and thus there is no way for the LogicalPlanBuilder to know which order to apply when building the schema (it uses the order: groups first, aggregations second)

@jorgecarleitao
Copy link
Member Author

@andygrove , this is IMO now ready. There were two errors that I fixed in two separate commits:

  • 3503c0d fixed an error caused by a mistake of mine
  • ff3ade9 fixed an error in one of the tests

The first one was related to the need of adding a projection when the order of SELECT is different from [groups] + [aggregates] in a group by. The current code has this and I had removed that part by mistake in this PR. Also, since this was only being tested in the integration tests, I have added a new test on the library to validate the logical plan.

The second issue is IMO an error in one of the tests: it was parsing SELECT c2, MIN(c12), MAX(c12) FROM aggregate_test_100 GROUP BY c1 as a valid statement and interpreting that statement as c2 representing c1. With this PR, that is identified as an error because c2 is not a valid column name in this context.

@andygrove
Copy link
Member

@jorgecarleitao Thanks. It looks like you will have to rebase one more time because master was rebased as part of the 1.0.0 release process. I can merge this once that is done.

@jorgecarleitao
Copy link
Member Author

@andygrove done. Thanks a lot for the effort and help!

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@andygrove
Copy link
Member

@jorgecarleitao The parquet version fix is merged, so one more rebase should get the tests passing.

Columns are no longer identified by its index, but by its name.

This is induced by the following assumption: every table that
we scan has a unique column name.

This greatly simplifies the code and the public API of physical plans
logical plans. This also greatly simplifies the projection push down,
and deprecates the ResolveColumns.
This is no longer needed.
@jorgecarleitao
Copy link
Member Author

rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants