Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jun 8, 2023

The original algorithm for real-to-decimal conversion did its computations in the floating-point domain, accumulating rounding errors especially for large scale or precision values, such as:

>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain. However, negative scales, which are presumably infrequent, fall back on the old algorithm.

@pitrou pitrou force-pushed the gh-35576-float-to-decimal branch from 18155dc to f8a6443 Compare June 8, 2023 16:02
@pitrou pitrou marked this pull request as ready for review June 8, 2023 19:52
@pitrou pitrou requested review from AlenkaF and westonpace as code owners June 8, 2023 19:52
@pitrou
Copy link
Member Author

pitrou commented Jun 8, 2023

@felipecrv @benibus Would either you have time to review this?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't promise to follow the conversion algorithm entirely :) but this looks well tested and the bit shifting routines make sense.

Comment on lines -954 to -960
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going away because it is superceded by the version in GenericBasicDecimal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove these cases? We can still convert negative numbers right? Isn't it just less precise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because CheckDecimalFromReal and CheckDecimalFromRealIntegerString now automatically deduce the negative test cases from the positive ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO (should we create an issue?) or is it a fact, in which case we can get rid of the XXX?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I wouldn't create an issue for it as the effort-benefit ratio is probably low.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 9, 2023
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't claim I went through all the details of the implementation, but it looks like the kind of code that doesn't have rarely-taken branches that wouldn't be exercised from the unit tests. LGTM.

pitrou added 4 commits June 14, 2023 12:57
The original algorithm for real-to-decimal conversion did its computations in the floating-point domain,
accumulating rounding errors especially for large scale or precision values, such as:
```
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 11))
<pyarrow.lib.Decimal128Array object at 0x7f05f4a3f1c0>
[
  1234567889.99999995904
]
>>> pa.array([1234567890.]).cast(pa.decimal128(38, 12))
<pyarrow.lib.Decimal128Array object at 0x7f05f494f9a0>
[
  1234567890.000000057344
]
```

The new algorithm strives to avoid precision loss by doing all its computations in the decimal domain.
However, negative scales, which are presumably infrequent, fall back on the old algorithm.
@pitrou pitrou force-pushed the gh-35576-float-to-decimal branch from 5f0abee to 87be93a Compare June 14, 2023 12:47
@pitrou
Copy link
Member Author

pitrou commented Jun 14, 2023

I think I have addressed all review comments. Would you like to take another look? @westonpace @felipecrv

@pitrou pitrou merged commit b6eab1f into apache:main Jun 14, 2023
@pitrou pitrou deleted the gh-35576-float-to-decimal branch June 14, 2023 14:49
@conbench-apache-arrow
Copy link

Conbench analyzed the 7 benchmark runs on commit b6eab1f4.

There were 35 benchmark results indicating a performance regression:

The full Conbench report has more details.

return {
// -- Stress the 24 bits of precision of a float
// 2**63 + 2**40
FromFloatTestParam{9.223373e+18f, 19, 0, "9223373136366403584"},
Copy link
Contributor

@huberylee huberylee Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Hi, the expected return value of FromFloatTestParam{5.76460752e13f, 18, 4, "57646075230342.3488"} is 57646073774080.0000, which seems different from the original value, does that meet expectations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you call "the original value" in that context?
Both values are actually equal in float32:

>>> np.float32(5.76460752e13) == np.float32(57646073774080)
True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you call "the original value" in that context? Both values are actually equal in float32:

>>> np.float32(5.76460752e13) == np.float32(57646073774080)
True

I see what you mean. Thanks!

@AlenkaF AlenkaF removed their request for review November 1, 2023 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Decimal{128,256}::FromReal accuracy loss on non-small scale values

4 participants