[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0) #5226
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem analysis
The essence of the problem is behavior of negative zero (- 0.0) in comparison with positive zero (+ 0.0).
Currently in GroupBy and HashPartition, -0.0 is not equal to 0.0 (result of Hash function), so the -0.0 and 0.0 are divided into 2 partitions.
In row_number analytic function, for the sorted data, a new partition will be opened when the values of the upper and lower rows are not equal. But in C++ the comparison 0.0 == -0.0 is true, so 0.0 and -0.0 are divided into the same partition for row_number.
(Floating point arithmetic in C++ is often IEEE-754. This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals. Refer to https://stackoverflow.com/questions/45795397)
Fix method
(Deprecated) Modifies the eq comparison of two Doubles in BinaryPredicate, and returns -0.0 == 0.0 as false when both sides of expr are SlotRef.
Problems:
(1) When order by is still considered equal to 0.0 and -0.0, the order of 0.0 and -0.0 in the result is random
(2) The reason for restricting both sides of BinaryPredicate to be SlotRef is that the constant cannot be defined as -0.0. For example, the expression of
where k1 = -0.0in SQL in actual calculation iswhere k1 = 0.0.Modify the Hash function in
hash_util.hpp, when data is negative zero, rewrite it to 0.0, which will be used inDataStreamSender::send HashPartitionandGroupBy, because the original value of data is covered, so -0.0 in the original data after Hash will be changed to 0.0.Our goal is that -0.0 will not appear in Doris, and all will be replaced with 0.0. Therefore, you need to modify the Broker Load in the future, and change -0.0 to 0.0 when importing. Mysql and SparkSQL are currently converted to 0.0 when Load -0.0 to Double column. But there is a situation that may not be avoided, that is,
select round(-0.2,0);, the returned-0.0is meaningful, refer to https://www.johndcook.com/blog/2010/06/15/why-computers-have-signed-zero/Types of changes
Checklist