-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](regr) Use Youngs-Cramer for REGR_SLOPE/INTERCEPT to align with PG #55940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
| -- !regr_intercept_int -- | ||
| 1000001.0 | ||
| -9.999E9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to
- https://docs.snowflake.com/en/sql-reference/functions/regr_intercept
- https://docs.snowflake.com/en/sql-reference/functions/regr_slope
REGR_INTERCEPT(y, x) = AVG(y) - REGR_SLOPE(y, x) * AVG(x)
REGR_SLOPE(y, x) = COVAR_POP(x, y) / VAR_POP(x)Verification
select regr_intercept(col_int, col_bigint) from d_table;PostgreSQL
select avg(col_int) - regr_slope(col_int, col_bigint) * avg(col_bigint) from d_table;
+---------------+
| ?column? |
|---------------|
| -9999000000.0 |
+---------------+Clickhouse (stable covariance/variance)
SELECT avg(col_int) - ((covarPopStable(col_bigint, col_int) / varPopStable(col_bigint)) * avg(col_bigint))
FROM d_table
┌─minus(avg(co⋯l_bigint)))─┐
1. │ -9999000000 │ -- -10.00 billion
└──────────────────────────┘
| -- !regr_slope_int -- | ||
| -0.0 | ||
| 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verification
PostgreSQL
select regr_slope(col_int, col_bigint) from d_table;
+------------+
| regr_slope |
|------------|
| 1.0 |
+------------+
select covar_pop(col_bigint, col_int) / var_pop(col_bigint) from d_table;
+----------+
| ?column? |
|----------|
| 1.0 |
+----------+Clickhouse (stable covariance/variance)
SELECT covarPopStable(col_bigint, col_int) / varPopStable(col_bigint)
FROM d_table
┌─divide(covar⋯ol_bigint))─┐
1. │ 1 │
└──────────────────────────┘| -- !regr_slope_largeint -- | ||
| 17725.127617654194 | ||
| 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verification
PostgreSQL
select regr_slope(col_largeint, col_float) from d_table;
+------------+
| regr_slope |
|------------|
| 0.0 |
+------------+
select covar_pop(col_float, col_largeint) / var_pop(col_float) from d_table;
+----------+
| ?column? |
|----------|
| 0.0 |
+----------+Clickhouse (stable covariance/variance)
SELECT covarPopStable(col_float, col_largeint) / varPopStable(col_float)
FROM d_table
┌─divide(covar⋯col_float))─┐
1. │ 0 │
└──────────────────────────┘|
run buildall |
TPC-H: Total hot run time: 34635 ms |
TPC-DS: Total hot run time: 188040 ms |
ClickBench: Total hot run time: 30.12 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
Thanks for your contribution. we are busy in these few days. When I have time I will prioritize reviewing it. |
Got it, thanks! |
|
run buildall |
TPC-DS: Total hot run time: 182964 ms |
ClickBench: Total hot run time: 27.35 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 36219 ms |
TPC-DS: Total hot run time: 179217 ms |
ClickBench: Total hot run time: 27.26 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 35694 ms |
TPC-DS: Total hot run time: 181004 ms |
ClickBench: Total hot run time: 27.3 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 35067 ms |
TPC-DS: Total hot run time: 181646 ms |
ClickBench: Total hot run time: 27.47 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
I’m seeing a CI failure that seems related to an OSS download in the pipeline. Error snippet: I tried updating the branch several times, but the CI still fails with the same 404 error. It looks like the OSS object Any guidance would be appreciated. Thanks! |
linrrzqqq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by anyone and no changes requested. |
I will look into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
zclllyybb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
…PG (#55940) This PR reimplements `REGR_SLOPE` and `REGR_INTERCEPT` using the Youngs–Cramer algorithm to align with PostgreSQL. It also extends `AggregateFunctionRegrData<T>` so it can be reused by all `REGR_*` functions (`SXX`, `SYY`, `SXY`, `R2`, etc.). ```sql -- Copy from `regression-test/suites/query_p0/aggregate/support_type/regr_slope/regr_slope.groovy` -- dataset (PostgreSQL) drop table if exists d_table; create table d_table ( k1 int, k2 int not null, k3 bigint, col_tinyint smallint, col_smallint smallint, col_int int, col_bigint bigint, col_largeint numeric(38,0), col_float real, col_double double precision ); insert into d_table values (1, 1, 1, 100, 10000, 1000000, 10000000000, 100000000000000000000, 3.14, 2.718281828), (2, 2, 2, 101, 10001, 1000001, 10000000001, 100000000000000000001, 6.28, 3.141592653), (3, 3, 3, 102, 10002, 1000002, 10000000002, 100000000000000000002, 9.42, 1.618033988); select regr_slope(col_tinyint, col_smallint) from d_table; -- 1.0 select regr_slope(col_smallint, col_int) from d_table; -- 1.0 select regr_slope(col_int, col_bigint) from d_table; -- 1.0 select regr_slope(col_bigint, col_largeint) from d_table; -- <null> select regr_slope(col_largeint, col_float) from d_table; -- 0.0 select regr_slope(col_float, col_double) from d_table; -- -2.7928921351549283 select regr_slope(col_double, col_tinyint) from d_table; -- -0.5501239200000003 select regr_intercept(col_tinyint, col_smallint) from d_table; -- -9900.0 select regr_intercept(col_smallint, col_int) from d_table; -- -990000.0 select regr_intercept(col_int, col_bigint) from d_table; -- -9999000000.0 select regr_intercept(col_bigint, col_largeint) from d_table; -- <null> select regr_intercept(col_largeint, col_float) from d_table; -- 1e+20 select regr_intercept(col_float, col_double) from d_table; -- 13.241664047161668 select regr_intercept(col_double, col_tinyint) from d_table; -- 58.055152076333364 ```
…PG (#55940) This PR reimplements `REGR_SLOPE` and `REGR_INTERCEPT` using the Youngs–Cramer algorithm to align with PostgreSQL. It also extends `AggregateFunctionRegrData<T>` so it can be reused by all `REGR_*` functions (`SXX`, `SYY`, `SXY`, `R2`, etc.). ```sql -- Copy from `regression-test/suites/query_p0/aggregate/support_type/regr_slope/regr_slope.groovy` -- dataset (PostgreSQL) drop table if exists d_table; create table d_table ( k1 int, k2 int not null, k3 bigint, col_tinyint smallint, col_smallint smallint, col_int int, col_bigint bigint, col_largeint numeric(38,0), col_float real, col_double double precision ); insert into d_table values (1, 1, 1, 100, 10000, 1000000, 10000000000, 100000000000000000000, 3.14, 2.718281828), (2, 2, 2, 101, 10001, 1000001, 10000000001, 100000000000000000001, 6.28, 3.141592653), (3, 3, 3, 102, 10002, 1000002, 10000000002, 100000000000000000002, 9.42, 1.618033988); select regr_slope(col_tinyint, col_smallint) from d_table; -- 1.0 select regr_slope(col_smallint, col_int) from d_table; -- 1.0 select regr_slope(col_int, col_bigint) from d_table; -- 1.0 select regr_slope(col_bigint, col_largeint) from d_table; -- <null> select regr_slope(col_largeint, col_float) from d_table; -- 0.0 select regr_slope(col_float, col_double) from d_table; -- -2.7928921351549283 select regr_slope(col_double, col_tinyint) from d_table; -- -0.5501239200000003 select regr_intercept(col_tinyint, col_smallint) from d_table; -- -9900.0 select regr_intercept(col_smallint, col_int) from d_table; -- -990000.0 select regr_intercept(col_int, col_bigint) from d_table; -- -9999000000.0 select regr_intercept(col_bigint, col_largeint) from d_table; -- <null> select regr_intercept(col_largeint, col_float) from d_table; -- 1e+20 select regr_intercept(col_float, col_double) from d_table; -- 13.241664047161668 select regr_intercept(col_double, col_tinyint) from d_table; -- 58.055152076333364 ```
…PG (apache#55940) This PR reimplements `REGR_SLOPE` and `REGR_INTERCEPT` using the Youngs–Cramer algorithm to align with PostgreSQL. It also extends `AggregateFunctionRegrData<T>` so it can be reused by all `REGR_*` functions (`SXX`, `SYY`, `SXY`, `R2`, etc.). ```sql -- Copy from `regression-test/suites/query_p0/aggregate/support_type/regr_slope/regr_slope.groovy` -- dataset (PostgreSQL) drop table if exists d_table; create table d_table ( k1 int, k2 int not null, k3 bigint, col_tinyint smallint, col_smallint smallint, col_int int, col_bigint bigint, col_largeint numeric(38,0), col_float real, col_double double precision ); insert into d_table values (1, 1, 1, 100, 10000, 1000000, 10000000000, 100000000000000000000, 3.14, 2.718281828), (2, 2, 2, 101, 10001, 1000001, 10000000001, 100000000000000000001, 6.28, 3.141592653), (3, 3, 3, 102, 10002, 1000002, 10000000002, 100000000000000000002, 9.42, 1.618033988); select regr_slope(col_tinyint, col_smallint) from d_table; -- 1.0 select regr_slope(col_smallint, col_int) from d_table; -- 1.0 select regr_slope(col_int, col_bigint) from d_table; -- 1.0 select regr_slope(col_bigint, col_largeint) from d_table; -- <null> select regr_slope(col_largeint, col_float) from d_table; -- 0.0 select regr_slope(col_float, col_double) from d_table; -- -2.7928921351549283 select regr_slope(col_double, col_tinyint) from d_table; -- -0.5501239200000003 select regr_intercept(col_tinyint, col_smallint) from d_table; -- -9900.0 select regr_intercept(col_smallint, col_int) from d_table; -- -990000.0 select regr_intercept(col_int, col_bigint) from d_table; -- -9999000000.0 select regr_intercept(col_bigint, col_largeint) from d_table; -- <null> select regr_intercept(col_largeint, col_float) from d_table; -- 1e+20 select regr_intercept(col_float, col_double) from d_table; -- 13.241664047161668 select regr_intercept(col_double, col_tinyint) from d_table; -- 58.055152076333364 ```
…onRegrData (#59224) ### What problem does this PR solve? Issue Number: close #38977 Problem Summary: This PR migrates regr_sxx/syy/sxy onto the shared Moment(AggregateFunctionRegrData) introduced in #55940. The original implementation and tests were done in #39187 by @wyxxxcat. This PR builds on top of that work, refactoring it to reuse the same state and merge logic. --------- Co-authored-by: wyxxxcat <1520358997@qq.com>
What problem does this PR solve?
Issue Number: #38975
Problem Summary:
This PR reimplements
REGR_SLOPEandREGR_INTERCEPTusing the Youngs–Cramer algorithm to align with PostgreSQL.It also extends
AggregateFunctionRegrData<T>so it can be reused by allREGR_*functions (SXX,SYY,SXY,R2, etc.).Next step:
Refactor
REGR_SXX/REGR_SYY/REGR_SXY/REGR_R2to reuse the sameAggregateFunctionRegrData<T>state(n, sx, sy, sxx, syy, sxy)and Youngs–Cramer merge. This ensures consistent numerics and associative merges across allREGR_*functions, and aligns results with PostgreSQL.regr_sxx = sxxregr_syy = syyregr_sxy = sxyregr_r2 = (sxy * sxy) / (sxx * syy)Release note
None
Check List (For Author)
Test
Behavior changed:
REGR_SLOPE/REGR_INTERCEPTnow follow PostgreSQL / SQL:2003 semantics:slope = sxy / sxx,intercept = (sy - sx * sxy/sxx) / n.Does this need documentation?
Check List (For Reviewer who merge this PR)