-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Rewrite large OR chains as IN lists #6414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looking forward to checking this one out @aprimadi -- thank you! I plan to review this tomorrow or Tuesday! |
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
|
I just got home from my trip and I'm going to review the code more carefully tomorrow. |
|
Btw, we can also simplify Not sure how often does that statement occur in practice but my guess is that it doesn't occur very often. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @aprimadi -- I think this looks great to me ❤️
I would like to reconsider adding fancier diffing to the tpch benchmark tests as explained below.
I also had some suggestions about avoiding clones in the rewrite that might be worth a look
cc @stuartcarnie as I think you may be interested in this one
benchmarks/expected-plans/q19.txt
Outdated
| | | Inner Join: lineitem.l_partkey = part.p_partkey Filter: part.p_brand = Utf8("Brand#12") AND part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND part.p_size <= Int32(5) OR part.p_brand = Utf8("Brand#23") AND part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND part.p_size <= Int32(10) OR part.p_brand = Utf8("Brand#34") AND part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND part.p_size <= Int32(15) | | ||
| | | Projection: lineitem.l_partkey, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount | | ||
| | | Filter: (lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)) AND (lineitem.l_shipmode = Utf8("AIR REG") OR lineitem.l_shipmode = Utf8("AIR")) AND lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") | | ||
| | | Filter: (lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)) AND (lineitem.l_shipmode = Utf8("AIR") OR lineitem.l_shipmode = Utf8("AIR REG")) AND lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these differences also appear to be that the order of some of the OR predicates has changed, which is fine as described above
benchmarks/src/bin/tpch.rs
Outdated
| Ok(()) | ||
| } | ||
|
|
||
| struct Line(Option<usize>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a personal preference but if you want to introduce a new crate for comparing / diffing, I recommend adding https://insta.rs/
- It includes automatic updating (
cargo insta review) - It comes with its own diff generation tool
- We have used it with good luck in influxdb_iox
benchmarks/src/bin/tpch.rs
Outdated
| } | ||
| } | ||
|
|
||
| fn assert_text_eq(expected: String, actual: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think it would help a lot to remove tests from the benchmarks -- @liurenjie1024 is making good progress in this area with #6435
Maybe we could avoid adding this extra comparison to the tpch binary and instead work on getting the tests out of the benchmark runner in the first place 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @liurenjie1024 is working on removing these tests from the benchmark, perhaps we should just remove this? I keep this change contained in 1 commit so removing it should be as simple as reverting a commit.
The sqllogictests may benefit from adding fancier diff 🤔 but it's not a concern of this PR.
| let unboxed_left = *left.clone(); | ||
| let unboxed_right = *right.clone(); | ||
| match (&unboxed_left, &unboxed_right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid these clones by doing something like:
Expr::BinaryExpr(BinaryExpr { left, op, right }) if *op == Operator::Eq => {
match (left.as_ref(), right.as_ref()) {
I tried it out locally and I think the following diff results in fewer clones:
diff --git a/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs b/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs
index 10f3aa027..9192fbb77 100644
--- a/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs
+++ b/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs
@@ -17,6 +17,8 @@
//! This module implements a rule that simplifies OR expressions into IN list expressions
+use std::borrow::Cow;
+
use datafusion_common::tree_node::TreeNodeRewriter;
use datafusion_common::Result;
use datafusion_expr::expr::InList;
@@ -48,6 +50,8 @@ impl TreeNodeRewriter for OrInListSimplifier {
&& !lhs.negated
&& !rhs.negated
{
+ let lhs = lhs.into_owned();
+ let rhs = rhs.into_owned();
let mut list = vec![];
list.extend(lhs.list);
list.extend(rhs.list);
@@ -67,23 +71,21 @@ impl TreeNodeRewriter for OrInListSimplifier {
}
/// Try to convert an expression to an in-list expression
-fn as_inlist(expr: &Expr) -> Option<InList> {
+fn as_inlist(expr: &Expr) -> Option<Cow<InList>> {
match expr {
- Expr::InList(inlist) => Some(inlist.clone()),
+ Expr::InList(inlist) => Some(Cow::Borrowed(&inlist)),
Expr::BinaryExpr(BinaryExpr { left, op, right }) if *op == Operator::Eq => {
- let unboxed_left = *left.clone();
- let unboxed_right = *right.clone();
- match (&unboxed_left, &unboxed_right) {
- (Expr::Column(_), Expr::Literal(_)) => Some(InList {
+ match (left.as_ref(), right.as_ref()) {
+ (Expr::Column(_), Expr::Literal(_)) => Some(Cow::Owned(InList {
expr: left.clone(),
- list: vec![unboxed_right],
+ list: vec![*right.clone()],
negated: false,
- }),
- (Expr::Literal(_), Expr::Column(_)) => Some(InList {
+ })),
+ (Expr::Literal(_), Expr::Column(_)) => Some(Cow::Owned(InList {
expr: right.clone(),
- list: vec![unboxed_left],
+ list: vec![*left.clone()],
negated: false,
- }),
+ })),
_ => None,
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb. I've been thinking about how to avoid those clones but since I'm kinda new to Rust I got stuck. Now I know.
# Conflicts: # benchmarks/expected-plans/q12.txt # benchmarks/expected-plans/q19.txt # benchmarks/src/bin/tpch.rs # datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
|
I took the liberty of merging this PR up from main |
jackwener
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks @jackwener and @alamb for the review. |
Which issue does this PR close?
Closes #6363
Rationale for this change
N/A
What changes are included in this PR?
The assert_text_eq helper method performs line diff algorithm similar to git that pinpoints exact difference.
Are these changes tested?
Yes
Are there any user-facing changes?
No