-
Notifications
You must be signed in to change notification settings - Fork 1.9k
add an example of using DataFrame to create a subquery #5961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jiangzhx -- I do not think the challenge of making subqueries is only felt by you -- the focus has been on handling subqueries from SQL I think rather than with DataFrame.
I have some ideas how to make this better. I'll see what I can do
| ctx.table("t1") | ||
| .await? | ||
| .filter( | ||
| Expr::ScalarSubquery(datafusion_expr::Subquery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is 🤮 -- let me see if I can come up with some way to make this easier to construct
| ), | ||
| outer_ref_columns: vec![], | ||
| }) | ||
| .gt(lit(ScalarValue::UInt8(Some(0)))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .gt(lit(ScalarValue::UInt8(Some(0)))), | |
| .gt(lit(0u8)), |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one way we could simplify the example: jiangzhx#179 (a PR into this branch)
I do think there
ctx.table("t1")
.await?
.filter(
exists(Arc::new(
ctx.table("t2")
.await?
.filter(col("t1.c1").eq(col("t2.c1")))?
.aggregate(vec![], vec![avg(col("t2.c2"))])?
.select(vec![avg(col("t2.c2"))])?
.into_unoptimized_plan(),
))
.gt(lit(0u8)),By implementing some traits, I think we could remove the Arc and into_unoptimized_plan call
ctx.table("t1")
.await?
.filter(
exists( ctx.table("t2")
.await?
.filter(col("t1.c1").eq(col("t2.c1")))?
.aggregate(vec![], vec![avg(col("t2.c2"))])?
.select(vec![avg(col("t2.c2"))])?
))
.gt(lit(0u8)),with something like
pub fn scalar_subquery(subquery: impl IntoSubquery) -> Expr {
...
}
/// Something that can be converted into a plan suitable for a subquery
pub trait IntoSubquery {
fn into_subquery(self) -> Arc<LogicalPlan>
}
// and then implement IntoSubqury for `LogicalPlan`, `Arc<LogicalPlan>` and `DataFrame` Though maybe that is getting to complicated 🤔
Simplify expression examples
|
I think we can refine this example further but it is better than what we have at the moment |
|
Thanks again @jiangzhx |
* add an example of using DataFrame to create a subquery * Simplify expression examples --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
it is difficult to write the same logic based on the DataFrame API as SQL.
Therefore, I created this example hoping to help others.
The reason for the difficulty is,
Maybe it's just my personal reason.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?