Skip to content

Conversation

@doki23
Copy link
Contributor

@doki23 doki23 commented Mar 22, 2022

Which issue does this PR close?

Closes #2041 .

Rationale for this change

do some check before building a logical plan

@github-actions github-actions bot added datafusion sql SQL Planner labels Mar 22, 2022
@doki23 doki23 mentioned this pull request Mar 22, 2022
/// Build the plan
pub fn build(&self) -> Result<LogicalPlan> {
Ok(self.plan.clone())
if let Some(err) = check_plan_invalid(&self.plan) {
Copy link
Member

@jackwener jackwener Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to return Result instead of Option<Err>.

Then, We can handle error like this.

match err {
    Ok() => Ok(s),
    Err() => Err(e),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean Result<LogicalPlan, DataFusionError>?
Hm, this check function just finds the err, if it returns Ok(plan), it's a little bit strange🤔.
Another reason is we must write return root_plan.clone() in each arm of the match if use Result.
If you mean Result<(), DataFusionError>, I think it's the same as Option, right?
Or if I fail to get your idea?

Copy link
Member

@jackwener jackwener Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean Result<(), DataFusionError>, I think it's the same as Option, right?

Yes, Essentially they are all enumeration types

A result represent either success/ Ok or failure/ Err. It's more elegant and reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write this as well:

check_plan_invalid(&self.plan)?;

Which will return the error

}

/// do some checks for exprs in a logical plan
fn check_invalid_expr(expr: &Expr, schema: &DFSchemaRef) -> Option<DataFusionError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

.iter()
.filter_map(|e| check_invalid_expr(e, schema))
.next()
}
Copy link
Contributor Author

@doki23 doki23 Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if use Result, code here looks like:

    for expr in exprs {
        let r = check_invalid_expr(expr, schema);
        if r.is_err() {
            return r;
        }
    }
    Ok(())

or

    exprs
        .iter()
        .map(|e| check_invalid_expr(e, schema))
        .fold_ok((), |start, v| start) // I dont like it

I dont like the fold_ok((), |start, v| start)
I don't know how to make it more elegant, please give more suggestions.
@jackwener @alamb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

for expr in exprs { 
    check_invalid_expr(expr, schema)?
}
Ok(())

You can also use https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.try_for_each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I applied the suggested changes from @jackwener which is same as yours.

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checkout this PR locally and rewrite use Result.

It can compile successfully. You just apply those suggestion and continue working base on it.

Comment on lines 1097 to 1239
/// check whether the logical plan we are building is valid
fn check_plan_invalid(plan: &LogicalPlan) -> Option<DataFusionError> {
match plan {
LogicalPlan::Projection(Projection { expr, input, .. })
| LogicalPlan::Sort(Sort { expr, input })
| LogicalPlan::Window(Window {
window_expr: expr,
input,
..
}) => check_plan_invalid(input)
.or_else(|| check_any_invalid_expr(expr, input.schema())),

LogicalPlan::Filter(Filter {
predicate: expr,
input,
}) => {
check_plan_invalid(input).or_else(|| check_invalid_expr(expr, input.schema()))
}

LogicalPlan::Aggregate(Aggregate {
input,
group_expr,
aggr_expr,
..
}) => {
let schema = input.schema();
check_plan_invalid(input)
.or_else(|| check_any_invalid_expr(group_expr, schema))
.or_else(|| check_any_invalid_expr(aggr_expr, schema))
}

LogicalPlan::Join(Join { left, right, .. })
| LogicalPlan::CrossJoin(CrossJoin { left, right, .. }) => {
check_plan_invalid(left).or_else(|| check_plan_invalid(right))
}

LogicalPlan::Repartition(Repartition { input, .. })
| LogicalPlan::Limit(Limit { input, .. })
| LogicalPlan::Explain(Explain { plan: input, .. })
| LogicalPlan::Analyze(Analyze { input, .. }) => check_plan_invalid(input),

LogicalPlan::Union(Union { inputs, .. }) => {
inputs.iter().filter_map(check_plan_invalid).next()
}

LogicalPlan::TableScan(TableScan {
table_name: _,
source,
projection,
projected_schema,
filters,
limit: _,
}) => {
if let Some(projection) = projection {
if projection.len() > projected_schema.fields().len() {
return Some(DataFusionError::Plan(
"projection contains columns that doesnt belong to projected schema"
.to_owned(),
));
}
}
let schema = &source.schema().to_dfschema_ref().ok()?;
check_any_invalid_expr(filters, schema)
}

_ => None,
}
}

/// find first error in the exprs
fn check_any_invalid_expr(
exprs: &[Expr],
schema: &DFSchemaRef,
) -> Option<DataFusionError> {
exprs
.iter()
.filter_map(|e| check_invalid_expr(e, schema))
.next()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// check whether the logical plan we are building is valid
fn check_plan_invalid(plan: &LogicalPlan) -> Option<DataFusionError> {
match plan {
LogicalPlan::Projection(Projection { expr, input, .. })
| LogicalPlan::Sort(Sort { expr, input })
| LogicalPlan::Window(Window {
window_expr: expr,
input,
..
}) => check_plan_invalid(input)
.or_else(|| check_any_invalid_expr(expr, input.schema())),
LogicalPlan::Filter(Filter {
predicate: expr,
input,
}) => {
check_plan_invalid(input).or_else(|| check_invalid_expr(expr, input.schema()))
}
LogicalPlan::Aggregate(Aggregate {
input,
group_expr,
aggr_expr,
..
}) => {
let schema = input.schema();
check_plan_invalid(input)
.or_else(|| check_any_invalid_expr(group_expr, schema))
.or_else(|| check_any_invalid_expr(aggr_expr, schema))
}
LogicalPlan::Join(Join { left, right, .. })
| LogicalPlan::CrossJoin(CrossJoin { left, right, .. }) => {
check_plan_invalid(left).or_else(|| check_plan_invalid(right))
}
LogicalPlan::Repartition(Repartition { input, .. })
| LogicalPlan::Limit(Limit { input, .. })
| LogicalPlan::Explain(Explain { plan: input, .. })
| LogicalPlan::Analyze(Analyze { input, .. }) => check_plan_invalid(input),
LogicalPlan::Union(Union { inputs, .. }) => {
inputs.iter().filter_map(check_plan_invalid).next()
}
LogicalPlan::TableScan(TableScan {
table_name: _,
source,
projection,
projected_schema,
filters,
limit: _,
}) => {
if let Some(projection) = projection {
if projection.len() > projected_schema.fields().len() {
return Some(DataFusionError::Plan(
"projection contains columns that doesnt belong to projected schema"
.to_owned(),
));
}
}
let schema = &source.schema().to_dfschema_ref().ok()?;
check_any_invalid_expr(filters, schema)
}
_ => None,
}
}
/// find first error in the exprs
fn check_any_invalid_expr(
exprs: &[Expr],
schema: &DFSchemaRef,
) -> Option<DataFusionError> {
exprs
.iter()
.filter_map(|e| check_invalid_expr(e, schema))
.next()
}
/// check whether the logical plan we are building is valid
fn check_plan_invalid(plan: &LogicalPlan) -> Result<()> {
match plan {
LogicalPlan::Projection(Projection { expr, input, .. })
| LogicalPlan::Sort(Sort { expr, input })
| LogicalPlan::Window(Window {
window_expr: expr,
input,
..
}) => check_plan_invalid(input)
.or_else(|_| check_any_invalid_expr(expr, input.schema())),
LogicalPlan::Filter(Filter {
predicate: expr,
input,
}) => check_plan_invalid(input)
.or_else(|_| check_invalid_expr(expr, input.schema())),
LogicalPlan::Aggregate(Aggregate {
input,
group_expr,
aggr_expr,
..
}) => {
let schema = input.schema();
check_plan_invalid(input)
.or_else(|_| check_any_invalid_expr(group_expr, schema))
.or_else(|_| check_any_invalid_expr(aggr_expr, schema))
}
LogicalPlan::Join(Join { left, right, .. })
| LogicalPlan::CrossJoin(CrossJoin { left, right, .. }) => {
check_plan_invalid(left).or_else(|_| check_plan_invalid(right))
}
LogicalPlan::Repartition(Repartition { input, .. })
| LogicalPlan::Limit(Limit { input, .. })
| LogicalPlan::Explain(Explain { plan: input, .. })
| LogicalPlan::Analyze(Analyze { input, .. }) => check_plan_invalid(input),
LogicalPlan::Union(Union { inputs, .. }) => {
inputs.iter().map(check_plan_invalid).collect()
}
LogicalPlan::TableScan(TableScan {
table_name: _,
source,
projection,
projected_schema,
filters,
limit: _,
}) => {
if let Some(projection) = projection {
if projection.len() > projected_schema.fields().len() {
return Err(DataFusionError::Plan(
"projection contains columns that doesnt belong to projected schema"
.to_owned(),
));
}
}
let schema = &source.schema().to_dfschema_ref()?;
check_any_invalid_expr(filters, schema)
}
_ => Ok(()),
}
}

Copy link
Contributor Author

@doki23 doki23 Mar 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the late response! Thanks @jackwener , the code looks better!

@jackwener
Copy link
Member

Sorry for the reply a little late.

@doki23
Copy link
Contributor Author

doki23 commented Mar 24, 2022

Thanks for your detailed suggestions! @jackwener
I'll take a closer look in a moment.

@doki23 doki23 marked this pull request as ready for review March 26, 2022 03:29
@doki23
Copy link
Contributor Author

doki23 commented Mar 26, 2022

Thank you @jackwener . The code looks better!

@doki23 doki23 changed the title add some pre-check for LogicalPlanBuilder(WIP) add some pre-check for LogicalPlanBuilder Mar 27, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I think this is looking good @doki23

The only thing I would really like to see before merging is a test for the wildcard check. The others comments are simply nice to have.

}

/// do some checks for exprs in a logical plan
fn check_invalid_expr(expr: &Expr, schema: &DFSchemaRef) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may have missed it, but as written I don't think this recurses into the expr -- so while it will find nonsense like

NOT "foo"

I don't think it would find

A = 5 OR (NOT "foo")

We could use an ExpressionVisitor for thishttps://github.com/apache/arrow-datafusion/blob/634252b44cec8dc904e48926d0aa54feeb4d48af/datafusion/src/logical_plan/expr_visitor.rs#L32-L47

Here is an example of using ExpressionVisitor for something similar:
https://github.com/apache/arrow-datafusion/blob/2d6addd4c435123e934ce04564d8cc77bf101c37/datafusion/src/datasource/listing/helpers.rs#L55-L119

Copy link
Contributor Author

@doki23 doki23 Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that visitor pattern is better, I'll try that.

}
}
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please add a test for using wildcards incorrectly -- like SELECT .. GROUP BY * and SELECT .. GROUP BY foo*?


#[test]
fn select_neg_filter() {
let sql = "SELECT id, first_name, last_name \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

expand_wildcard(&qualifier_schema, plan)
}

/// check whether the logical plan we are building is valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR, it would be awesome to encapsulate this LogicalPlan tree walking code into some sort of visitor.

doki23 and others added 2 commits March 29, 2022 08:48
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@doki23 doki23 marked this pull request as draft March 29, 2022 04:13
@doki23
Copy link
Contributor Author

doki23 commented Mar 31, 2022

I add pre-check after optimization. Although there's no unified validation framework, but it seems that it's always valid after optimization😂. So this pr may be unnecessary. I'm going to close it temporarily if there're no other suggestions or counter examples.

@jackwener
Copy link
Member

😂

@doki23 doki23 closed this Apr 1, 2022
@doki23 doki23 deleted the wildcard_check branch March 24, 2023 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do some pre-check when building LogicalPlan

4 participants