Skip to content

Conversation

@xinyiZzz
Copy link
Contributor

@xinyiZzz xinyiZzz commented Nov 26, 2020

Update on 2020-11-30:
Replaced by calculating variance based on Decimal as intermediate variable, this is a more elegant way. See below for details.
The first trick method of cast to double before can be seen: xinyiZzz@1f17248
——————————————————————————————————————————

Proposed changes

By cast Decimal to Double, realize the variance calculation of Decimal type column.
In my local Spark2.4 and Mysql, the variance of Decimal is also calculated by cast to Double.
NOTE: There may be two parts of precision loss:

  • Decimal cast to Double
  • In the variance calculation process, the precision of the Double type intermediate variable is lost (the effective number of digits of Double is 16)

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)

Checklist

Further comments

Before fix the variance calculation by Decimal cast to Double, I implemented a method of directly calculating the variance on Decimal, using DecimalV2Value as the type of intermediate variable. see Commit: xinyiZzz@0288a6f

Compared with cast to Double, the considerations and advantages of calculating variance directly based on Decimal:

  • Less loss of precision.
  • Faster performance.
  • Consistent with other functions, such as avg(), sum(), etc., these functions use Decimal to calculate directly in Doris, and use DecimalV2Value as the type of intermediate variable.

However, there is an overflow problem in calculating variance based on Decimal. When Variance> 9223372036854775807999999999/N (9223372036854775807999999999 is the maximum value of DecimalV2Value, N is the number of data rows), the intermediate variable in the calculation process will overflow and return a maximum value.

I feel that theoretical business requirements will not calculate the variance greater than 9223372036854775807999999999/N, so cast to Double means lower accuracy and worse performance, but both Spark and Mysql use cast to double methods, so I Pull this code.

precision comparison:

  • When the first 16 bits of the data are the same, such as 12345678901234560 and 12345678901234561, at this time, the result calculated by cast to double is 0 (loss of precision), and the correct result by direct calculation by Decimal is 0.5
  • When the variance result is large, compared to the result after cast to double 42981942961.912369, the result directly calculated by Decimal 42981942961.913988699 is more accurate

Performance comparison:

The test results of variance/stddev and other functions are as follows:

  • Cast to Double calculation: 0.054s
  • Decimal direct calculation: 0.034s

Table building method:

CREATE TABLE stddev_samp_test29
(event_day int, money DECIMAL(27,9) DEFAULT "0")
DISTRIBUTED BY HASH(event_day) BUCKETS 5
PROPERTIES("replication_num" = "1");

Data with 10w rows of equal length, such as 99.00066, is generated by:

f = open('data.txt','w')
n = 100000
while (n):
 f.write('0,' + "%.5f"% (99 + random.random()) +'\n')
 n -= 1

It’s too important to complete the preliminary research, so sad...

@EmmyMiao87 EmmyMiao87 added the area/sql/function Issues or PRs related to the SQL functions label Nov 26, 2020
EmmyMiao87
EmmyMiao87 previously approved these changes Nov 26, 2020
Copy link
Contributor

@EmmyMiao87 EmmyMiao87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman
Copy link
Contributor

I think this PR it too trick.
xinyiZzz@0288a6f
This seems better to me.

@xinyiZzz
Copy link
Contributor Author

xinyiZzz commented Nov 29, 2020

I think this PR it too trick.
xinyiZzz@0288a6f
This seems better to me.

I am not sure if it is inconsistent with Spark, whether it will cause problems later.
Maybe I can do it in an elegant way, doing cast to double in Be's function, this will be similar to the second method.
Also this will change more places. The first trick above is the smallest change I found.

…andard deviation, and use DecimalValue to store intermediate results
@xinyiZzz xinyiZzz force-pushed the decimal_variance_stddev_cast_double branch from e284910 to 91ca898 Compare December 1, 2020 03:45
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman added the approved Indicates a PR has been approved by one committer. label Dec 5, 2020
@morningman morningman merged commit b1b99ae into apache:master Dec 6, 2020
@yangzhg yangzhg mentioned this pull request Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/sql/function Issues or PRs related to the SQL functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

STDDEV_SAMP函数计算结果与mysql偏差大

3 participants