Skip to content

Conversation

@aprimadi
Copy link
Contributor

@aprimadi aprimadi commented May 22, 2023

Which issue does this PR close?

Closes #6363

Rationale for this change

N/A

What changes are included in this PR?

  • Modify expression simplifier to simplify large OR chains into IN LIST
  • Add assert_text_eq helper method to quickly debug problems in TPCH benchmark test
    The assert_text_eq helper method performs line diff algorithm similar to git that pinpoints exact difference.
  • Adding some tests

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the optimizer Optimizer rules label May 22, 2023
@github-actions github-actions bot added the core Core DataFusion crate label May 24, 2023
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 28, 2023
@aprimadi aprimadi marked this pull request as ready for review May 28, 2023 06:56
@alamb
Copy link
Contributor

alamb commented May 28, 2023

Looking forward to checking this one out @aprimadi -- thank you! I plan to review this tomorrow or Tuesday!

@jackwener
Copy link
Member

I just got home from my trip and I'm going to review the code more carefully tomorrow.
Thank you @aprimadi

@aprimadi
Copy link
Contributor Author

Btw, we can also simplify A != 1 AND A != 2 AND ... into A NOT IN (1, 2, ...). If this is desirable, I can create an issue and a follow up PR.

Not sure how often does that statement occur in practice but my guess is that it doesn't occur very often.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aprimadi -- I think this looks great to me ❤️

I would like to reconsider adding fancier diffing to the tpch benchmark tests as explained below.

I also had some suggestions about avoiding clones in the rewrite that might be worth a look

cc @stuartcarnie as I think you may be interested in this one

| | Inner Join: lineitem.l_partkey = part.p_partkey Filter: part.p_brand = Utf8("Brand#12") AND part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND part.p_size <= Int32(5) OR part.p_brand = Utf8("Brand#23") AND part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND part.p_size <= Int32(10) OR part.p_brand = Utf8("Brand#34") AND part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND part.p_size <= Int32(15) |
| | Projection: lineitem.l_partkey, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount |
| | Filter: (lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)) AND (lineitem.l_shipmode = Utf8("AIR REG") OR lineitem.l_shipmode = Utf8("AIR")) AND lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") |
| | Filter: (lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)) AND (lineitem.l_shipmode = Utf8("AIR") OR lineitem.l_shipmode = Utf8("AIR REG")) AND lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these differences also appear to be that the order of some of the OR predicates has changed, which is fine as described above

Ok(())
}

struct Line(Option<usize>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a personal preference but if you want to introduce a new crate for comparing / diffing, I recommend adding https://insta.rs/

  1. It includes automatic updating (cargo insta review)
  2. It comes with its own diff generation tool
  3. We have used it with good luck in influxdb_iox

}
}

fn assert_text_eq(expected: String, actual: String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it would help a lot to remove tests from the benchmarks -- @liurenjie1024 is making good progress in this area with #6435

Maybe we could avoid adding this extra comparison to the tpch binary and instead work on getting the tests out of the benchmark runner in the first place 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @liurenjie1024 is working on removing these tests from the benchmark, perhaps we should just remove this? I keep this change contained in 1 commit so removing it should be as simple as reverting a commit.

The sqllogictests may benefit from adding fancier diff 🤔 but it's not a concern of this PR.

Comment on lines +74 to +76
let unboxed_left = *left.clone();
let unboxed_right = *right.clone();
match (&unboxed_left, &unboxed_right) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid these clones by doing something like:

        Expr::BinaryExpr(BinaryExpr { left, op, right }) if *op == Operator::Eq => {
            match (left.as_ref(), right.as_ref()) {

I tried it out locally and I think the following diff results in fewer clones:

diff --git a/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs b/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs
index 10f3aa027..9192fbb77 100644
--- a/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs
+++ b/datafusion/optimizer/src/simplify_expressions/or_in_list_simplifier.rs
@@ -17,6 +17,8 @@
 
 //! This module implements a rule that simplifies OR expressions into IN list expressions
 
+use std::borrow::Cow;
+
 use datafusion_common::tree_node::TreeNodeRewriter;
 use datafusion_common::Result;
 use datafusion_expr::expr::InList;
@@ -48,6 +50,8 @@ impl TreeNodeRewriter for OrInListSimplifier {
                         && !lhs.negated
                         && !rhs.negated
                     {
+                        let lhs = lhs.into_owned();
+                        let rhs = rhs.into_owned();
                         let mut list = vec![];
                         list.extend(lhs.list);
                         list.extend(rhs.list);
@@ -67,23 +71,21 @@ impl TreeNodeRewriter for OrInListSimplifier {
 }
 
 /// Try to convert an expression to an in-list expression
-fn as_inlist(expr: &Expr) -> Option<InList> {
+fn as_inlist(expr: &Expr) -> Option<Cow<InList>> {
     match expr {
-        Expr::InList(inlist) => Some(inlist.clone()),
+        Expr::InList(inlist) => Some(Cow::Borrowed(&inlist)),
         Expr::BinaryExpr(BinaryExpr { left, op, right }) if *op == Operator::Eq => {
-            let unboxed_left = *left.clone();
-            let unboxed_right = *right.clone();
-            match (&unboxed_left, &unboxed_right) {
-                (Expr::Column(_), Expr::Literal(_)) => Some(InList {
+            match (left.as_ref(), right.as_ref()) {
+                (Expr::Column(_), Expr::Literal(_)) => Some(Cow::Owned(InList {
                     expr: left.clone(),
-                    list: vec![unboxed_right],
+                    list: vec![*right.clone()],
                     negated: false,
-                }),
-                (Expr::Literal(_), Expr::Column(_)) => Some(InList {
+                })),
+                (Expr::Literal(_), Expr::Column(_)) => Some(Cow::Owned(InList {
                     expr: right.clone(),
-                    list: vec![unboxed_left],
+                    list: vec![*left.clone()],
                     negated: false,
-                }),
+                })),
                 _ => None,
             }
         }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb. I've been thinking about how to avoid those clones but since I'm kinda new to Rust I got stuck. Now I know.

@alamb
Copy link
Contributor

alamb commented May 31, 2023

I took the liberty of merging this PR up from main

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aprimadi and @alamb

@jackwener jackwener merged commit 2f264ab into apache:main Jun 1, 2023
@aprimadi
Copy link
Contributor Author

aprimadi commented Jun 1, 2023

Thanks @jackwener and @alamb for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite large OR chains as IN lists

3 participants