-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| let opt_plan1 = ctx.optimize(&plan1)?; | ||
|
|
||
| let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the PR the test fails, as it doesn't optimize the plan (an optimized plan just returns the same as a plan for SELECT 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Well spotted. Thanks @Dandandan !
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- nice catch @Dandandan. There appears to be a test failure in one of the tests on this PR however
|
hm it seems it's slightly more complicated
|
|
keeping as a draft for now, I think it's more open for discussion maybe what to do here. Do we want the dataframe from |
Ideally in my mind we would be able to run the optimizations twice (so we could do it with the initial call to @Dandandan something I have been thinking recently (as I prepared for my talk next week on DataFusion as well as talking with @NGA-TRAN on my team at Influx) was how similar the I almost wonder if we should combine the two somehow... I don't have a concrete proposal now just 🤔 |
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least #9612 and #9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes #9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
I removed the check / added a test for the projection pushdown that it returns the same plan when optimizing twice and removed the check. I am not sure what the check was trying to prevent? It seems it passes all the tests (which use sql + collect quite often).
Thanks. Yeah For example the public function But this PR now runs the optimizer twice if you use |
45f2800 to
42420b3
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me. Thanks again @Dandandan
I also re-ran the DataFusion tests locally on this branch after merging from master to make sure all still looks good. 👍
|
|
||
| let opt_plan1 = ctx.optimize(&plan1)?; | ||
|
|
||
| let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I believe we should expect
ExecutionContext::sqlto return an optimized logical plan (with current applying config) rather than aDataFramewith an unoptimized plan.I believe so because
replin docs useExecutionContext::sqlThe TPC-H benchmarks don't use
ExecutionContext::sqlwhich is I guess why it was missed before.FYI @alamb @andygrove