-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9516: [Rust][DataFusion] refactor of column names #7796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was necessary because the file's schema is c1, c2, not state, salary. We were able to get away with this because, since the names did not mean anything, we could read a file with a given schema using another schema's field names, as long as the indexing was correct. I believe that this should not be possible by design, as it introduces situations that are non-trivial to debug.
rust/datafusion/src/logicalplan.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already the case - I am just making it explicit in this comment.
|
@andygrove fyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're losing the alias name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we just need to fix this so that the physical expression is created using the aliased name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the same reason: the physical plan does not care about names anymore, and aliases are only about naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi my comment was written before I read your second comment;
these pages are definitely not thread safe!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the schema for these physical operators is based on the logical schema? I guess I just want to be sure that the alias name is preserved in the query results.
A unit test demonstrating this would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I wrote my other comment because I became uncertain about my claim and wanted to double check ^_^: yes, good to have this triple checked; this is a big change)
Let's see:
- the schema of the logical plan is built on the column's names as per this line, which is then passed to the physical plan as per line of this discussion. So, it should be the case...
- line 1089 of
context.rsIMO demonstrates this: we alias an expression in the logical plan, and the physical's plan schema's field of that column is the alias itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I just pulled your branch and modified one of the integration tests just to be sure, and it looks good. This really helps simplify some things, thanks!
Could you rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) perfect. I have rebased this against the latest master.
|
|
|
The test The query is EDIT 1: The root cause is that |
|
@andygrove , this is IMO now ready. There were two errors that I fixed in two separate commits:
The first one was related to the need of adding a projection when the order of SELECT is different from [groups] + [aggregates] in a group by. The current code has this and I had removed that part by mistake in this PR. Also, since this was only being tested in the integration tests, I have added a new test on the library to validate the logical plan. The second issue is IMO an error in one of the tests: it was parsing |
|
@jorgecarleitao Thanks. It looks like you will have to rebase one more time because master was rebased as part of the 1.0.0 release process. I can merge this once that is done. |
|
@andygrove done. Thanks a lot for the effort and help! |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@jorgecarleitao The parquet version fix is merged, so one more rebase should get the tests passing. |
Columns are no longer identified by its index, but by its name. This is induced by the following assumption: every table that we scan has a unique column name. This greatly simplifies the code and the public API of physical plans logical plans. This also greatly simplifies the projection push down, and deprecates the ResolveColumns.
This is no longer needed.
|
rebased |
This PR is respective to ARROW-9516.
It:
SUM(a), SUM(b)instead ofSUM, SUM.This is currently a proof of value: all tests pass and stuff, but there is no decision made on whether we should proceed with these changes. More details available at ARROW-9516.