[Bug] Fix the loss of precision when Decimal calculates variance/stddev #4959
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Update on 2020-11-30:
Replaced by calculating variance based on Decimal as intermediate variable, this is a more elegant way. See below for details.
The first trick method of cast to double before can be seen: xinyiZzz@1f17248
——————————————————————————————————————————
Proposed changes
By cast Decimal to Double, realize the variance calculation of Decimal type column.
In my local Spark2.4 and Mysql, the variance of Decimal is also calculated by cast to Double.
NOTE: There may be two parts of precision loss:
Types of changes
What types of changes does your code introduce to Doris?
Put an
xin the boxes that applyChecklist
Further comments
Before fix the variance calculation by Decimal cast to Double, I implemented a method of directly calculating the variance on Decimal, using DecimalV2Value as the type of intermediate variable. see Commit: xinyiZzz@0288a6f
Compared with cast to Double, the considerations and advantages of calculating variance directly based on Decimal:
avg(),sum(), etc., these functions use Decimal to calculate directly in Doris, and use DecimalV2Value as the type of intermediate variable.However, there is an overflow problem in calculating variance based on Decimal. When
Variance> 9223372036854775807999999999/N(9223372036854775807999999999 is the maximum value of DecimalV2Value, N is the number of data rows), the intermediate variable in the calculation process will overflow and return a maximum value.I feel that theoretical business requirements will not calculate the variance greater than
9223372036854775807999999999/N, so cast to Double means lower accuracy and worse performance, but both Spark and Mysql use cast to double methods, so I Pull this code.precision comparison:
12345678901234560and12345678901234561, at this time, the result calculated by cast to double is 0 (loss of precision), and the correct result by direct calculation by Decimal is 0.542981942961.912369, the result directly calculated by Decimal42981942961.913988699is more accuratePerformance comparison:
The test results of variance/stddev and other functions are as follows:
Table building method:
Data with 10w rows of equal length, such as
99.00066, is generated by:It’s too important to complete the preliminary research, so sad...