Skip to content

Conversation

@xinyiZzz
Copy link
Contributor

@xinyiZzz xinyiZzz commented Jan 12, 2021

Problem analysis

The essence of the problem is behavior of negative zero (- 0.0) in comparison with positive zero (+ 0.0).
Currently in GroupBy and HashPartition, -0.0 is not equal to 0.0 (result of Hash function), so the -0.0 and 0.0 are divided into 2 partitions.

In row_number analytic function, for the sorted data, a new partition will be opened when the values ​​​​of the upper and lower rows are not equal. But in C++ the comparison 0.0 == -0.0 is true, so 0.0 and -0.0 are divided into the same partition for row_number.

(Floating point arithmetic in C++ is often IEEE-754. This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals. Refer to https://stackoverflow.com/questions/45795397)

Fix method

  1. (Deprecated) Modifies the eq comparison of two Doubles in BinaryPredicate, and returns -0.0 == 0.0 as false when both sides of expr are SlotRef.
    Problems:
    (1) When order by is still considered equal to 0.0 and -0.0, the order of 0.0 and -0.0 in the result is random
    (2) The reason for restricting both sides of BinaryPredicate to be SlotRef is that the constant cannot be defined as -0.0. For example, the expression of where k1 = -0.0 in SQL in actual calculation is where k1 = 0.0.

  2. Modify the Hash function in hash_util.hpp, when data is negative zero, rewrite it to 0.0, which will be used in DataStreamSender::send HashPartition and GroupBy, because the original value of data is covered, so -0.0 in the original data after Hash will be changed to 0.0.

Our goal is that -0.0 will not appear in Doris, and all will be replaced with 0.0. Therefore, you need to modify the Broker Load in the future, and change -0.0 to 0.0 when importing. Mysql and SparkSQL are currently converted to 0.0 when Load -0.0 to Double column. But there is a situation that may not be avoided, that is, select round(-0.2,0);, the returned -0.0 is meaningful, refer to https://www.johndcook.com/blog/2010/06/15/why-computers-have-signed-zero/

Types of changes

  • Bugfix (non-breaking change which fixes an issue)

Checklist

@xinyiZzz xinyiZzz force-pushed the fix_0_partition branch 2 times, most recently from ef76585 to 99841f8 Compare January 13, 2021 06:26
@EmmyMiao87 EmmyMiao87 added kind/fix Categorizes issue or PR as related to a bug. SQL Bug labels Jan 14, 2021
Copy link
Contributor

@EmmyMiao87 EmmyMiao87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EmmyMiao87 EmmyMiao87 added the approved Indicates a PR has been approved by one committer. label Jan 19, 2021
@morningman morningman merged commit 34bfc42 into apache:master Jan 23, 2021
morningman added a commit that referenced this pull request Jan 25, 2021
EmmyMiao87 pushed a commit to EmmyMiao87/incubator-doris that referenced this pull request Jan 26, 2021
…ition (apache#5226)

The essence of the problem is behavior of negative zero (- 0.0) in comparison with positive zero (+ 0.0).
Currently in GroupBy and HashPartition, -0.0 is not equal to 0.0 (result of Hash function),
so the -0.0 and 0.0 are divided into 2 partitions.

In row_number analytic function, for the sorted data, a new partition will be opened when the values ​​​​of
the upper and lower rows are not equal. But in C++ the comparison 0.0 == -0.0 is true, so 0.0 and -0.0
are divided into the same partition for row_number.

(Floating point arithmetic in C++ is often IEEE-754. This norm defines two different representations for
the value zero: positive zero and negative zero. It is also defined that those two representations must
compare equals. Refer to https://stackoverflow.com/questions/45795397)
EmmyMiao87 pushed a commit to EmmyMiao87/incubator-doris that referenced this pull request Jan 26, 2021
… -0 partition (apache#5226)"

This reverts commit b455a04.

Change-Id: I2aa12dad664758375161adcf139f4631054a7221
morningman added a commit that referenced this pull request Jan 26, 2021
… -0 partition (#5226)" (#5297)

This reverts commit 34bfc42.
The hash algo may be overflow
@yangzhg yangzhg mentioned this pull request Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. kind/fix Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] row_number and group by have inconsistent partition results for (0.0, -0.0)

3 participants