-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhance](orc) Optimize ORC Predicate Pushdown for OR-connected Predicate #43255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
run buildall |
|
TeamCity be ut coverage result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
run buildall |
921f87f to
5056b99
Compare
|
run buildall |
|
TeamCity be ut coverage result: |
|
run external |
1 similar comment
|
run external |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
run external |
|
run buildall |
|
TeamCity be ut coverage result: |
| bool OrcReader::_init_search_argument( | ||
| std::unordered_map<std::string, ColumnValueRangeType>* colname_to_value_range) { | ||
| if ((!_enable_filter_by_min_max) || colname_to_value_range->empty()) { | ||
| bool OrcReader::_init_search_argument(const VExprContextSPtrs& conjuncts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a certain query, the conjuncts is identical in all orc readers.
So I think we can just init search argument once and use the result for every orc reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is possible to cache searchArgument into member variables of OrcReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
|
run external |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
### What problem does this PR solve? ### Release note relate pr: #43255 Improved ACID table column handling - Added support for ACID column prefix in ORC column initialization - Fixed column name handling for ACID tables - Improved type mapping for ACID table columns
### What problem does this PR solve? Related PR: apache#43255 Problem Summary: Example: ```sql CREATE TABLE table_a ( id INT, age INT ) STORED AS ORC; INSERT INTO table_a VALUES (1, null), (2, 18), (3, null), (4, 25); CREATE TABLE table_b ( id INT, age INT ) STORED AS ORC; INSERT INTO table_b VALUES (1, null), (2, null), (3, 1000000), (4, 100); ``` run sql ``` select * from table_a inner join table_b on table_a.age <=> table_b.age and table_b.id in (1,3); ``` When executing this SQL, the backend generates a runtime filter on the table_a side during the join operation, resulting in a condition like WHERE table_a.age IN (NULL, 1000000). It’s important to note that since <=> is a null-aware comparison operator, the IN predicate must also be null-aware. However, the ORC predicate pushdown API does not support null-aware IN predicates. As a result, our current approach ignores null values, leading to an empty result set for this query. To fix this bug, we’ve adjusted the logic so that predicates with null-aware comparisons are not pushed down, ensuring the correct result as follows: ```text +------+------+------+------+ | id | age | id | age | +------+------+------+------+ | 1 | NULL | 1 | NULL | | 3 | NULL | 1 | NULL | +------+------+------+------+ ```
### What problem does this PR solve? ### Release note relate pr: apache#43255 Improved ACID table column handling - Added support for ACID column prefix in ORC column initialization - Fixed column name handling for ACID tables - Improved type mapping for ACID table columns
…cate (apache#43255) Problem Summary: This issue addresses a limitation in Apache Doris where only predicates joined by AND are pushed down to the ORC reader, leaving OR-connected predicates unoptimized. By extending pushdown functionality to handle these OR conditions, the aim is to better leverage ORC’s predicate pushdown capabilities, reducing data reads and improving query performance.
…4615) In the old logic, the `check_expr_can_push_down` function does not check whether the `orc::Literal` are constructed successfully, but only checks during `build_search_argument`. However, if it is found that the `orc::Literal` fails to be constructed after `builder->startNot`, it will fail because the builder cannot end `startNot`. Therefore, we advance the behavior of constructing `orc::Literal` to the `check_expr_can_push_down` function and save the result to the map, so that it will never fail in the `build_search_argument` phase. Related PR: apache#43255
Related PR: apache#43255 Problem Summary: Example: ```sql CREATE TABLE table_a ( id INT, age INT ) STORED AS ORC; INSERT INTO table_a VALUES (1, null), (2, 18), (3, null), (4, 25); CREATE TABLE table_b ( id INT, age INT ) STORED AS ORC; INSERT INTO table_b VALUES (1, null), (2, null), (3, 1000000), (4, 100); ``` run sql ``` select * from table_a inner join table_b on table_a.age <=> table_b.age and table_b.id in (1,3); ``` When executing this SQL, the backend generates a runtime filter on the table_a side during the join operation, resulting in a condition like WHERE table_a.age IN (NULL, 1000000). It’s important to note that since <=> is a null-aware comparison operator, the IN predicate must also be null-aware. However, the ORC predicate pushdown API does not support null-aware IN predicates. As a result, our current approach ignores null values, leading to an empty result set for this query. To fix this bug, we’ve adjusted the logic so that predicates with null-aware comparisons are not pushed down, ensuring the correct result as follows: ```text +------+------+------+------+ | id | age | id | age | +------+------+------+------+ | 1 | NULL | 1 | NULL | | 3 | NULL | 1 | NULL | +------+------+------+------+ ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
relate pr: apache#43255 Improved ACID table column handling - Added support for ACID column prefix in ORC column initialization - Fixed column name handling for ACID tables - Improved type mapping for ACID table columns
…cate (apache#43255) Problem Summary: This issue addresses a limitation in Apache Doris where only predicates joined by AND are pushed down to the ORC reader, leaving OR-connected predicates unoptimized. By extending pushdown functionality to handle these OR conditions, the aim is to better leverage ORC’s predicate pushdown capabilities, reducing data reads and improving query performance.
…4615) In the old logic, the `check_expr_can_push_down` function does not check whether the `orc::Literal` are constructed successfully, but only checks during `build_search_argument`. However, if it is found that the `orc::Literal` fails to be constructed after `builder->startNot`, it will fail because the builder cannot end `startNot`. Therefore, we advance the behavior of constructing `orc::Literal` to the `check_expr_can_push_down` function and save the result to the map, so that it will never fail in the `build_search_argument` phase. Related PR: apache#43255
Related PR: apache#43255 Problem Summary: Example: ```sql CREATE TABLE table_a ( id INT, age INT ) STORED AS ORC; INSERT INTO table_a VALUES (1, null), (2, 18), (3, null), (4, 25); CREATE TABLE table_b ( id INT, age INT ) STORED AS ORC; INSERT INTO table_b VALUES (1, null), (2, null), (3, 1000000), (4, 100); ``` run sql ``` select * from table_a inner join table_b on table_a.age <=> table_b.age and table_b.id in (1,3); ``` When executing this SQL, the backend generates a runtime filter on the table_a side during the join operation, resulting in a condition like WHERE table_a.age IN (NULL, 1000000). It’s important to note that since <=> is a null-aware comparison operator, the IN predicate must also be null-aware. However, the ORC predicate pushdown API does not support null-aware IN predicates. As a result, our current approach ignores null values, leading to an empty result set for this query. To fix this bug, we’ve adjusted the logic so that predicates with null-aware comparisons are not pushed down, ensuring the correct result as follows: ```text +------+------+------+------+ | id | age | id | age | +------+------+------+------+ | 1 | NULL | 1 | NULL | | 3 | NULL | 1 | NULL | +------+------+------+------+ ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
relate pr: apache#43255 Improved ACID table column handling - Added support for ACID column prefix in ORC column initialization - Fixed column name handling for ACID tables - Improved type mapping for ACID table columns
…cate (apache#43255) Problem Summary: This issue addresses a limitation in Apache Doris where only predicates joined by AND are pushed down to the ORC reader, leaving OR-connected predicates unoptimized. By extending pushdown functionality to handle these OR conditions, the aim is to better leverage ORC’s predicate pushdown capabilities, reducing data reads and improving query performance.
…4615) In the old logic, the `check_expr_can_push_down` function does not check whether the `orc::Literal` are constructed successfully, but only checks during `build_search_argument`. However, if it is found that the `orc::Literal` fails to be constructed after `builder->startNot`, it will fail because the builder cannot end `startNot`. Therefore, we advance the behavior of constructing `orc::Literal` to the `check_expr_can_push_down` function and save the result to the map, so that it will never fail in the `build_search_argument` phase. Related PR: apache#43255
Related PR: apache#43255 Problem Summary: Example: ```sql CREATE TABLE table_a ( id INT, age INT ) STORED AS ORC; INSERT INTO table_a VALUES (1, null), (2, 18), (3, null), (4, 25); CREATE TABLE table_b ( id INT, age INT ) STORED AS ORC; INSERT INTO table_b VALUES (1, null), (2, null), (3, 1000000), (4, 100); ``` run sql ``` select * from table_a inner join table_b on table_a.age <=> table_b.age and table_b.id in (1,3); ``` When executing this SQL, the backend generates a runtime filter on the table_a side during the join operation, resulting in a condition like WHERE table_a.age IN (NULL, 1000000). It’s important to note that since <=> is a null-aware comparison operator, the IN predicate must also be null-aware. However, the ORC predicate pushdown API does not support null-aware IN predicates. As a result, our current approach ignores null values, leading to an empty result set for this query. To fix this bug, we’ve adjusted the logic so that predicates with null-aware comparisons are not pushed down, ensuring the correct result as follows: ```text +------+------+------+------+ | id | age | id | age | +------+------+------+------+ | 1 | NULL | 1 | NULL | | 3 | NULL | 1 | NULL | +------+------+------+------+ ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
relate pr: apache#43255 Improved ACID table column handling - Added support for ACID column prefix in ORC column initialization - Fixed column name handling for ACID tables - Improved type mapping for ACID table columns
…cate (apache#43255) Problem Summary: This issue addresses a limitation in Apache Doris where only predicates joined by AND are pushed down to the ORC reader, leaving OR-connected predicates unoptimized. By extending pushdown functionality to handle these OR conditions, the aim is to better leverage ORC’s predicate pushdown capabilities, reducing data reads and improving query performance.
…4615) In the old logic, the `check_expr_can_push_down` function does not check whether the `orc::Literal` are constructed successfully, but only checks during `build_search_argument`. However, if it is found that the `orc::Literal` fails to be constructed after `builder->startNot`, it will fail because the builder cannot end `startNot`. Therefore, we advance the behavior of constructing `orc::Literal` to the `check_expr_can_push_down` function and save the result to the map, so that it will never fail in the `build_search_argument` phase. Related PR: apache#43255
Related PR: apache#43255 Problem Summary: Example: ```sql CREATE TABLE table_a ( id INT, age INT ) STORED AS ORC; INSERT INTO table_a VALUES (1, null), (2, 18), (3, null), (4, 25); CREATE TABLE table_b ( id INT, age INT ) STORED AS ORC; INSERT INTO table_b VALUES (1, null), (2, null), (3, 1000000), (4, 100); ``` run sql ``` select * from table_a inner join table_b on table_a.age <=> table_b.age and table_b.id in (1,3); ``` When executing this SQL, the backend generates a runtime filter on the table_a side during the join operation, resulting in a condition like WHERE table_a.age IN (NULL, 1000000). It’s important to note that since <=> is a null-aware comparison operator, the IN predicate must also be null-aware. However, the ORC predicate pushdown API does not support null-aware IN predicates. As a result, our current approach ignores null values, leading to an empty result set for this query. To fix this bug, we’ve adjusted the logic so that predicates with null-aware comparisons are not pushed down, ensuring the correct result as follows: ```text +------+------+------+------+ | id | age | id | age | +------+------+------+------+ | 1 | NULL | 1 | NULL | | 3 | NULL | 1 | NULL | +------+------+------+------+ ```
…ins (apache#45104) Related PR: apache#43255 Problem Summary: Should ignore null values when the literals of in_predicate contains null value, like `in (1, null)` For example, init table in hive: ```sql CREATE TABLE sample_orc_table ( id INT, name STRING, age INT ) STORED AS ORC; INSERT INTO TABLE sample_orc_table VALUES (1, 'Alice', 25), (2, NULL, NULL); ``` select result in Doris should be: ```sql mysql> select * from sample_orc_table where age in (null,25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.30 sec) mysql> select * from sample_orc_table where age in (25); +------+-------+------+ | id | name | age | +------+-------+------+ | 1 | Alice | 25 | +------+-------+------+ 1 row in set (0.27 sec) mysql> select * from sample_orc_table where age in (null); Empty set (0.01 sec) mysql> select * from sample_orc_table where age is null; +------+------+------+ | id | name | age | +------+------+------+ | 2 | NULL | NULL | +------+------+------+ 1 row in set (0.11 sec) ```
relate pr: apache#43255 Improved ACID table column handling - Added support for ACID column prefix in ORC column initialization - Fixed column name handling for ACID tables - Improved type mapping for ACID table columns
) In the old logic, the `check_expr_can_push_down` function does not check whether the `orc::Literal` are constructed successfully, but only checks during `build_search_argument`. However, if it is found that the `orc::Literal` fails to be constructed after `builder->startNot`, it will fail because the builder cannot end `startNot`. Therefore, we advance the behavior of constructing `orc::Literal` to the `check_expr_can_push_down` function and save the result to the map, so that it will never fail in the `build_search_argument` phase. Related PR: apache#43255 Conflicts: be/src/vec/exec/format/orc/vorc_reader.cpp be/test/vec/exec/orc_reader_test.cpp
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
This issue addresses a limitation in Apache Doris where only predicates joined by AND are pushed down to the ORC reader, leaving OR-connected predicates unoptimized. By extending pushdown functionality to handle these OR conditions, the aim is to better leverage ORC’s predicate pushdown capabilities, reducing data reads and improving query performance.
Check List (For Committer)
Test
Behavior changed:
Does this need documentation?
Release note
None
Check List (For Reviewer who merge this PR)