[Bug] Fix the loss of precision when Decimal calculates variance/stddev #4959

xinyiZzz · 2020-11-26T08:18:22Z

Update on 2020-11-30:
Replaced by calculating variance based on Decimal as intermediate variable, this is a more elegant way. See below for details.
The first trick method of cast to double before can be seen: xinyiZzz@1f17248
——————————————————————————————————————————

Proposed changes

By cast Decimal to Double, realize the variance calculation of Decimal type column.
In my local Spark2.4 and Mysql, the variance of Decimal is also calculated by cast to Double.
NOTE: There may be two parts of precision loss:

Decimal cast to Double
In the variance calculation process, the precision of the Double type intermediate variable is lost (the effective number of digits of Double is 16)

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)

Checklist

I have create an issue on (Fix STDDEV_SAMP函数计算结果与mysql偏差大 #3127), and have described the bug/feature there in detail
Compiling and unit tests pass locally with my changes

Further comments

Before fix the variance calculation by Decimal cast to Double, I implemented a method of directly calculating the variance on Decimal, using DecimalV2Value as the type of intermediate variable. see Commit: xinyiZzz@0288a6f

Compared with cast to Double, the considerations and advantages of calculating variance directly based on Decimal:

Less loss of precision.
Faster performance.
Consistent with other functions, such as avg(), sum(), etc., these functions use Decimal to calculate directly in Doris, and use DecimalV2Value as the type of intermediate variable.

However, there is an overflow problem in calculating variance based on Decimal. When Variance> 9223372036854775807999999999/N (9223372036854775807999999999 is the maximum value of DecimalV2Value, N is the number of data rows), the intermediate variable in the calculation process will overflow and return a maximum value.

I feel that theoretical business requirements will not calculate the variance greater than 9223372036854775807999999999/N, so cast to Double means lower accuracy and worse performance, but both Spark and Mysql use cast to double methods, so I Pull this code.

precision comparison:

When the first 16 bits of the data are the same, such as 12345678901234560 and 12345678901234561, at this time, the result calculated by cast to double is 0 (loss of precision), and the correct result by direct calculation by Decimal is 0.5
When the variance result is large, compared to the result after cast to double 42981942961.912369, the result directly calculated by Decimal 42981942961.913988699 is more accurate

Performance comparison:

The test results of variance/stddev and other functions are as follows:

Cast to Double calculation: 0.054s
Decimal direct calculation: 0.034s

Table building method:

CREATE TABLE stddev_samp_test29
(event_day int, money DECIMAL(27,9) DEFAULT "0")
DISTRIBUTED BY HASH(event_day) BUCKETS 5
PROPERTIES("replication_num" = "1");

Data with 10w rows of equal length, such as 99.00066, is generated by:

f = open('data.txt','w')
n = 100000
while (n):
 f.write('0,' + "%.5f"% (99 + random.random()) +'\n')
 n -= 1

It’s too important to complete the preliminary research, so sad...

EmmyMiao87

LGTM

morningman · 2020-11-27T04:45:07Z

I think this PR it too trick.
xinyiZzz@0288a6f
This seems better to me.

xinyiZzz · 2020-11-29T14:44:06Z

I think this PR it too trick.
xinyiZzz@0288a6f
This seems better to me.

I am not sure if it is inconsistent with Spark, whether it will cause problems later.
Maybe I can do it in an elegant way, doing cast to double in Be's function, this will be similar to the second method.
Also this will change more places. The first trick above is the smallest change I found.

…andard deviation, and use DecimalValue to store intermediate results

morningman

LGTM

EmmyMiao87 added the area/sql/function Issues or PRs related to the SQL functions label Nov 26, 2020

EmmyMiao87 previously approved these changes Nov 26, 2020

View reviewed changes

Fixed issue apache#3127: Support Decimal to calculate variance and st…

91ca898

…andard deviation, and use DecimalValue to store intermediate results

xinyiZzz dismissed EmmyMiao87’s stale review via 91ca898 December 1, 2020 03:45

xinyiZzz force-pushed the decimal_variance_stddev_cast_double branch from e284910 to 91ca898 Compare December 1, 2020 03:45

morningman approved these changes Dec 5, 2020

View reviewed changes

morningman added the approved Indicates a PR has been approved by one committer. label Dec 5, 2020

morningman merged commit b1b99ae into apache:master Dec 6, 2020

yangzhg mentioned this pull request Feb 9, 2021

Release Notes 0.14.0 #5374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix the loss of precision when Decimal calculates variance/stddev #4959

[Bug] Fix the loss of precision when Decimal calculates variance/stddev #4959

Uh oh!

xinyiZzz commented Nov 26, 2020 •

edited

Loading

Uh oh!

EmmyMiao87 left a comment

Uh oh!

morningman commented Nov 27, 2020

Uh oh!

xinyiZzz commented Nov 29, 2020 •

edited

Loading

Uh oh!

morningman left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bug] Fix the loss of precision when Decimal calculates variance/stddev #4959

[Bug] Fix the loss of precision when Decimal calculates variance/stddev #4959

Uh oh!

Conversation

xinyiZzz commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Types of changes

Checklist

Further comments

precision comparison:

Performance comparison:

Uh oh!

EmmyMiao87 left a comment

Choose a reason for hiding this comment

Uh oh!

morningman commented Nov 27, 2020

Uh oh!

xinyiZzz commented Nov 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinyiZzz commented Nov 26, 2020 •

edited

Loading

xinyiZzz commented Nov 29, 2020 •

edited

Loading