-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feat](Nereids): Optimize query by pushing down aggregation through join on foreign key #36035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
run buildall |
ddf7409 to
d4631c4
Compare
|
run buildall |
TPC-H: Total hot run time: 41103 ms |
TPC-DS: Total hot run time: 173088 ms |
ClickBench: Total hot run time: 31.15 s |
d4631c4 to
c2f6800
Compare
|
run buildall |
TPC-H: Total hot run time: 39773 ms |
TPC-DS: Total hot run time: 172670 ms |
ClickBench: Total hot run time: 31.26 s |
c2f6800 to
7fa9d2b
Compare
|
run buildall |
TPC-H: Total hot run time: 39239 ms |
TPC-DS: Total hot run time: 172819 ms |
ClickBench: Total hot run time: 30.7 s |
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/RuleType.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/PushDownAggThroughJoinByFk.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/nereids/rules/rewrite/PushDownAggThroughJoinByFk.java
Outdated
Show resolved
Hide resolved
...ore/src/test/java/org/apache/doris/nereids/rules/rewrite/PushDownAggThroughJoinByFkTest.java
Show resolved
Hide resolved
7fa9d2b to
c235383
Compare
|
run buildall |
1 similar comment
|
run buildall |
TPC-H: Total hot run time: 39697 ms |
TPC-DS: Total hot run time: 173455 ms |
ClickBench: Total hot run time: 30.65 s |
f15731d to
fb93934
Compare
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 40274 ms |
TPC-DS: Total hot run time: 173730 ms |
ClickBench: Total hot run time: 30.69 s |
| } | ||
|
|
||
| /** | ||
| * This class flattens nested join clusters and optimizes aggregation pushdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently it is a customized support of rbo-level join reorder. In the future we may extend more common utils to support other scenarios, such as over-64 join reorder by heristic methods.
1. get arrow flight result schema use query id instead of instance id. 2. get arrow flight result is a sync method, need wait for data ready and return result, introduced by #36035 36667. TODO, waiting for data will block pipeline, so use a request pool to save requests waiting for data.
|
run buildall |
TPC-H: Total hot run time: 39547 ms |
TPC-DS: Total hot run time: 173561 ms |
ClickBench: Total hot run time: 31.11 s |
…in on foreign key (#36035) ## Proposed changes This PR optimizes query performance by pushing down aggregations through joins when grouped by a foreign key. This adjustment reduces data processing overhead above the join, improving both speed and resource efficiency. Transformation Example: Before Optimization: ``` Aggregation(group by fk) | Join(pk = fk) / \ pk fk ``` After Optimization: ``` Join(pk = fk) / \ pk Aggregation(group by fk) | fk ```
#37343) intro by #36035 This PR refines the LogicalJoin class by introducing robust input validation. Key improvements: * Implement precise checks for join input validity * Ensure consistency between input slots and output sets * Gracefully handle various join scenarios (left/right) These enhancements bolster query integrity and optimize join operations.
1. get arrow flight result schema use query id instead of instance id. 2. get arrow flight result is a sync method, need wait for data ready and return result, introduced by apache#36035 36667. TODO, waiting for data will block pipeline, so use a request pool to save requests waiting for data.
#37343) intro by #36035 This PR refines the LogicalJoin class by introducing robust input validation. Key improvements: * Implement precise checks for join input validity * Ensure consistency between input slots and output sets * Gracefully handle various join scenarios (left/right) These enhancements bolster query integrity and optimize join operations.
Related PR: #36035 Problem Summary: The key of the aggregation must include the primary key of the primary key table (or contain a unique key that can form a bijection with the primary key) to push the aggregation to the foreign key table. Before this pr, doris have wrong results in this situation: drop table if exists customer_test; drop table if exists store_sales_test; CREATE TABLE customer_test ( c_customer_sk INT not null , c_first_name VARCHAR(50), c_last_name VARCHAR(50) ); CREATE TABLE store_sales_test ( ss_customer_sk INT, ss_date DATE ); INSERT INTO customer_test VALUES (1, 'John', 'Smith'); INSERT INTO customer_test VALUES (2, 'John', 'Smith'); INSERT INTO store_sales_test VALUES (1, '2024-01-01'); INSERT INTO store_sales_test VALUES (2, '2024-01-01'); alter table customer_test add constraint c_pk primary key (c_customer_sk); alter table store_sales_test add constraint ss_c_fk foreign key (ss_customer_sk) references customer_test(c_customer_sk); show constraints from customer_test; show constraints from store_sales_test; SELECT DISTINCT c_last_name, c_first_name, ss_date FROM store_sales_test inner join customer_test on store_sales_test.ss_customer_sk = customer_test.c_customer_sk; set disable_nereids_rules='PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK'; set disable_nereids_rules=''; Turn on PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK will have different result with turn off PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK before this pr. This is because AGG (group by c_last_name, c_first_name, ss_date) should not be pushed down below the JOIN operation. The original transform was: Agg(group by c_last_name, c_first_name, ss_date ) +--Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--scan(store_sales_test) -> Join +--scan(customer_test) +--Agg(group by ss_customer_sk,ss_date) +--scan(store_sales_test) This is an incorrect rewrite because it is not equivalent. This pr corrects the rewrite, allowing the aggregation to be pushed down below the join only when there is a bijective relationship between the group by key from the primary table and the fields in the foreign table (a functional dependency exists from a to b, and also from b to a, then a and b have a bijective relationship). For example, Agg(group by c_customer_sk, c_first_name, ss_date ) +--Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--scan(store_sales_test) -> Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--Agg(group by ss_customer_sk,ss_date) +--scan(store_sales_test) Since c_customer_sk is the primary key, c_first_name in the group by clause can be removed (based on functional dependencies). Furthermore, due to the equality relationship c_customer_sk = ss_customer_sk, there is a bijective relationship between c_customer_sk and ss_customer_sk. In this case, `group by c_customer_sk, ss_date` can be replaced with `group by ss_customer_sk, ss_date`. The aggregation group by key is entirely replaced with the output of the foreign table. Since a primary key-foreign key join does not expand the rows of the foreign table,In this situation, the aggregation can be pushed down.
Related PR: #36035 Problem Summary: The key of the aggregation must include the primary key of the primary key table (or contain a unique key that can form a bijection with the primary key) to push the aggregation to the foreign key table. Before this pr, doris have wrong results in this situation: drop table if exists customer_test; drop table if exists store_sales_test; CREATE TABLE customer_test ( c_customer_sk INT not null , c_first_name VARCHAR(50), c_last_name VARCHAR(50) ); CREATE TABLE store_sales_test ( ss_customer_sk INT, ss_date DATE ); INSERT INTO customer_test VALUES (1, 'John', 'Smith'); INSERT INTO customer_test VALUES (2, 'John', 'Smith'); INSERT INTO store_sales_test VALUES (1, '2024-01-01'); INSERT INTO store_sales_test VALUES (2, '2024-01-01'); alter table customer_test add constraint c_pk primary key (c_customer_sk); alter table store_sales_test add constraint ss_c_fk foreign key (ss_customer_sk) references customer_test(c_customer_sk); show constraints from customer_test; show constraints from store_sales_test; SELECT DISTINCT c_last_name, c_first_name, ss_date FROM store_sales_test inner join customer_test on store_sales_test.ss_customer_sk = customer_test.c_customer_sk; set disable_nereids_rules='PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK'; set disable_nereids_rules=''; Turn on PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK will have different result with turn off PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK before this pr. This is because AGG (group by c_last_name, c_first_name, ss_date) should not be pushed down below the JOIN operation. The original transform was: Agg(group by c_last_name, c_first_name, ss_date ) +--Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--scan(store_sales_test) -> Join +--scan(customer_test) +--Agg(group by ss_customer_sk,ss_date) +--scan(store_sales_test) This is an incorrect rewrite because it is not equivalent. This pr corrects the rewrite, allowing the aggregation to be pushed down below the join only when there is a bijective relationship between the group by key from the primary table and the fields in the foreign table (a functional dependency exists from a to b, and also from b to a, then a and b have a bijective relationship). For example, Agg(group by c_customer_sk, c_first_name, ss_date ) +--Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--scan(store_sales_test) -> Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--Agg(group by ss_customer_sk,ss_date) +--scan(store_sales_test) Since c_customer_sk is the primary key, c_first_name in the group by clause can be removed (based on functional dependencies). Furthermore, due to the equality relationship c_customer_sk = ss_customer_sk, there is a bijective relationship between c_customer_sk and ss_customer_sk. In this case, `group by c_customer_sk, ss_date` can be replaced with `group by ss_customer_sk, ss_date`. The aggregation group by key is entirely replaced with the output of the foreign table. Since a primary key-foreign key join does not expand the rows of the foreign table,In this situation, the aggregation can be pushed down.
…#59498) Related PR: apache#36035 Problem Summary: The key of the aggregation must include the primary key of the primary key table (or contain a unique key that can form a bijection with the primary key) to push the aggregation to the foreign key table. Before this pr, doris have wrong results in this situation: drop table if exists customer_test; drop table if exists store_sales_test; CREATE TABLE customer_test ( c_customer_sk INT not null , c_first_name VARCHAR(50), c_last_name VARCHAR(50) ); CREATE TABLE store_sales_test ( ss_customer_sk INT, ss_date DATE ); INSERT INTO customer_test VALUES (1, 'John', 'Smith'); INSERT INTO customer_test VALUES (2, 'John', 'Smith'); INSERT INTO store_sales_test VALUES (1, '2024-01-01'); INSERT INTO store_sales_test VALUES (2, '2024-01-01'); alter table customer_test add constraint c_pk primary key (c_customer_sk); alter table store_sales_test add constraint ss_c_fk foreign key (ss_customer_sk) references customer_test(c_customer_sk); show constraints from customer_test; show constraints from store_sales_test; SELECT DISTINCT c_last_name, c_first_name, ss_date FROM store_sales_test inner join customer_test on store_sales_test.ss_customer_sk = customer_test.c_customer_sk; set disable_nereids_rules='PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK'; set disable_nereids_rules=''; Turn on PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK will have different result with turn off PUSH_DOWN_AGG_THROUGH_JOIN_ON_PKFK before this pr. This is because AGG (group by c_last_name, c_first_name, ss_date) should not be pushed down below the JOIN operation. The original transform was: Agg(group by c_last_name, c_first_name, ss_date ) +--Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--scan(store_sales_test) -> Join +--scan(customer_test) +--Agg(group by ss_customer_sk,ss_date) +--scan(store_sales_test) This is an incorrect rewrite because it is not equivalent. This pr corrects the rewrite, allowing the aggregation to be pushed down below the join only when there is a bijective relationship between the group by key from the primary table and the fields in the foreign table (a functional dependency exists from a to b, and also from b to a, then a and b have a bijective relationship). For example, Agg(group by c_customer_sk, c_first_name, ss_date ) +--Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--scan(store_sales_test) -> Join(c_customer_sk=ss_customer_sk) +--scan(customer_test) +--Agg(group by ss_customer_sk,ss_date) +--scan(store_sales_test) Since c_customer_sk is the primary key, c_first_name in the group by clause can be removed (based on functional dependencies). Furthermore, due to the equality relationship c_customer_sk = ss_customer_sk, there is a bijective relationship between c_customer_sk and ss_customer_sk. In this case, `group by c_customer_sk, ss_date` can be replaced with `group by ss_customer_sk, ss_date`. The aggregation group by key is entirely replaced with the output of the foreign table. Since a primary key-foreign key join does not expand the rows of the foreign table,In this situation, the aggregation can be pushed down.
Proposed changes
This PR optimizes query performance by pushing down aggregations through joins when grouped by a foreign key. This adjustment reduces data processing overhead above the join, improving both speed and resource efficiency.
Transformation Example:
Before Optimization:
After Optimization: