add an example of using DataFrame to create a subquery #5961

jiangzhx · 2023-04-11T10:33:14Z

Which issue does this PR close?

it is difficult to write the same logic based on the DataFrame API as SQL.
Therefore, I created this example hoping to help others.

The reason for the difficulty is,

SQL case scalar_subquery logical_paln unexpected Aggregate: groupBy=[[col]] #5791 (comment)
SQL case, This feature is not implemented: Physical plan does not support logical expression EXISTS (<subquery>) #5789
SQL case, the subquery plan inside the sub query expression does not get a chance to run all the optimizer rules #5771 (comment)
remove duplicate the logic b/w DataFrame API and SQL planning #5686 (comment)

Maybe it's just my personal reason.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Thank you @jiangzhx -- I do not think the challenge of making subqueries is only felt by you -- the focus has been on handling subqueries from SQL I think rather than with DataFrame.

I have some ideas how to make this better. I'll see what I can do

alamb · 2023-04-11T21:10:14Z

datafusion-examples/examples/dataframe_subquery.rs

+    ctx.table("t1")
+        .await?
+        .filter(
+            Expr::ScalarSubquery(datafusion_expr::Subquery {


I agree this is 🤮 -- let me see if I can come up with some way to make this easier to construct

alamb · 2023-04-11T21:11:42Z

datafusion-examples/examples/dataframe_subquery.rs

+                ),
+                outer_ref_columns: vec![],
+            })
+            .gt(lit(ScalarValue::UInt8(Some(0)))),


Suggested change

.gt(lit(ScalarValue::UInt8(Some(0)))),

.gt(lit(0u8)),

alamb

Here is one way we could simplify the example: jiangzhx#179 (a PR into this branch)

I do think there

    ctx.table("t1")
        .await?
        .filter(
            exists(Arc::new(
                ctx.table("t2")
                    .await?
                    .filter(col("t1.c1").eq(col("t2.c1")))?
                    .aggregate(vec![], vec![avg(col("t2.c2"))])?
                    .select(vec![avg(col("t2.c2"))])?
                    .into_unoptimized_plan(),
            ))
            .gt(lit(0u8)),

By implementing some traits, I think we could remove the Arc and into_unoptimized_plan call

    ctx.table("t1")
        .await?
        .filter(
            exists( ctx.table("t2")
                    .await?
                    .filter(col("t1.c1").eq(col("t2.c1")))?
                    .aggregate(vec![], vec![avg(col("t2.c2"))])?
                    .select(vec![avg(col("t2.c2"))])?
            ))
            .gt(lit(0u8)),

with something like

pub fn scalar_subquery(subquery: impl IntoSubquery) -> Expr {
...
}

/// Something that can be converted into a plan suitable for a subquery
pub trait IntoSubquery {
  fn into_subquery(self) -> Arc<LogicalPlan>
}

// and then implement IntoSubqury for `LogicalPlan`, `Arc<LogicalPlan>` and `DataFrame`

Though maybe that is getting to complicated 🤔

Simplify expression examples

alamb · 2023-04-12T18:16:31Z

I think we can refine this example further but it is better than what we have at the moment

alamb · 2023-04-12T18:16:38Z

Thanks again @jiangzhx

* add an example of using DataFrame to create a subquery * Simplify expression examples --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

add an example of using DataFrame to create a subquery

2a21c63

alamb approved these changes Apr 11, 2023

View reviewed changes

Simplify expression examples

8bb1a7a

alamb mentioned this pull request Apr 11, 2023

Simplify expression examples jiangzhx/arrow-datafusion#179

Merged

alamb reviewed Apr 11, 2023

View reviewed changes

Merge pull request #179 from alamb/alamb/siimpler_example

3977662

Simplify expression examples

alamb merged commit 0e5f6df into apache:main Apr 12, 2023

alamb mentioned this pull request Jan 30, 2025

SQL case scalar_subquery logical_paln unexpected Aggregate: groupBy=[[col]] #5791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add an example of using DataFrame to create a subquery #5961

add an example of using DataFrame to create a subquery #5961

Uh oh!

jiangzhx commented Apr 11, 2023 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Apr 11, 2023

Uh oh!

alamb Apr 11, 2023

Uh oh!

alamb left a comment

Uh oh!

alamb commented Apr 12, 2023

Uh oh!

alamb commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add an example of using DataFrame to create a subquery #5961

add an example of using DataFrame to create a subquery #5961

Uh oh!

Conversation

jiangzhx commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

alamb Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 12, 2023

Uh oh!

alamb commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiangzhx commented Apr 11, 2023 •

edited

Loading