[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0) #5226

xinyiZzz · 2021-01-12T10:34:05Z

Problem analysis

The essence of the problem is behavior of negative zero (- 0.0) in comparison with positive zero (+ 0.0).
Currently in GroupBy and HashPartition, -0.0 is not equal to 0.0 (result of Hash function), so the -0.0 and 0.0 are divided into 2 partitions.

In row_number analytic function, for the sorted data, a new partition will be opened when the values of the upper and lower rows are not equal. But in C++ the comparison 0.0 == -0.0 is true, so 0.0 and -0.0 are divided into the same partition for row_number.

(Floating point arithmetic in C++ is often IEEE-754. This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals. Refer to https://stackoverflow.com/questions/45795397)

Fix method

(Deprecated) Modifies the eq comparison of two Doubles in BinaryPredicate, and returns -0.0 == 0.0 as false when both sides of expr are SlotRef.
Problems:
(1) When order by is still considered equal to 0.0 and -0.0, the order of 0.0 and -0.0 in the result is random
(2) The reason for restricting both sides of BinaryPredicate to be SlotRef is that the constant cannot be defined as -0.0. For example, the expression of where k1 = -0.0 in SQL in actual calculation is where k1 = 0.0.
Modify the Hash function in hash_util.hpp, when data is negative zero, rewrite it to 0.0, which will be used in DataStreamSender::send HashPartition and GroupBy, because the original value of data is covered, so -0.0 in the original data after Hash will be changed to 0.0.

Our goal is that -0.0 will not appear in Doris, and all will be replaced with 0.0. Therefore, you need to modify the Broker Load in the future, and change -0.0 to 0.0 when importing. Mysql and SparkSQL are currently converted to 0.0 when Load -0.0 to Double column. But there is a situation that may not be avoided, that is, select round(-0.2,0);, the returned -0.0 is meaningful, refer to https://www.johndcook.com/blog/2010/06/15/why-computers-have-signed-zero/

Types of changes

Bugfix (non-breaking change which fixes an issue)

Checklist

I have create an issue on (Fix [BUG] row_number and group by have inconsistent partition results for (0.0, -0.0) #5225), and have described the bug/feature there in detail
Compiling and unit tests pass locally with my changes

…ition

EmmyMiao87

LGTM

… -0 partition (#5226)" This reverts commit 34bfc42.

…ition (apache#5226) The essence of the problem is behavior of negative zero (- 0.0) in comparison with positive zero (+ 0.0). Currently in GroupBy and HashPartition, -0.0 is not equal to 0.0 (result of Hash function), so the -0.0 and 0.0 are divided into 2 partitions. In row_number analytic function, for the sorted data, a new partition will be opened when the values of the upper and lower rows are not equal. But in C++ the comparison 0.0 == -0.0 is true, so 0.0 and -0.0 are divided into the same partition for row_number. (Floating point arithmetic in C++ is often IEEE-754. This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals. Refer to https://stackoverflow.com/questions/45795397)

… -0 partition (apache#5226)" This reverts commit b455a04. Change-Id: I2aa12dad664758375161adcf139f4631054a7221

… -0 partition (#5226)" (#5297) This reverts commit 34bfc42. The hash algo may be overflow

xinyiZzz force-pushed the fix_0_partition branch 2 times, most recently from ef76585 to 99841f8 Compare January 13, 2021 06:26

EmmyMiao87 added kind/fix Categorizes issue or PR as related to a bug. SQL Bug labels Jan 14, 2021

xinyiZzz force-pushed the fix_0_partition branch from 99841f8 to 304b4f7 Compare January 18, 2021 09:19

[Bug] Fix row_number and group by are inconsistent with 0 and -0 part…

ab13434

…ition

xinyiZzz force-pushed the fix_0_partition branch from 304b4f7 to ab13434 Compare January 18, 2021 10:06

EmmyMiao87 approved these changes Jan 19, 2021

View reviewed changes

EmmyMiao87 added the approved Indicates a PR has been approved by one committer. label Jan 19, 2021

morningman merged commit 34bfc42 into apache:master Jan 23, 2021

morningman added a commit that referenced this pull request Jan 25, 2021

Revert "[Bug] Fix row_number and group by are inconsistent with 0 and…

683f93d

… -0 partition (#5226)" This reverts commit 34bfc42.

morningman mentioned this pull request Jan 25, 2021

Revert "[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0)" #5297

Merged

morningman added a commit that referenced this pull request Jan 26, 2021

Revert "[Bug] Fix row_number and group by are inconsistent with 0 and…

c084276

… -0 partition (#5226)" (#5297) This reverts commit 34bfc42. The hash algo may be overflow

yangzhg mentioned this pull request Feb 9, 2021

Release Notes 0.14.0 #5374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0) #5226

[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0) #5226

Uh oh!

xinyiZzz commented Jan 12, 2021 •

edited

Loading

Uh oh!

EmmyMiao87 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0) #5226

[Bug] Fix row_number and group by have inconsistent partition results for (0.0, -0.0) #5226

Uh oh!

Conversation

xinyiZzz commented Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem analysis

Fix method

Types of changes

Checklist

Uh oh!

EmmyMiao87 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinyiZzz commented Jan 12, 2021 •

edited

Loading