Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 12, 2023

Which issue does this PR close?

Closes #5674.
Closes #3387.
Closes #4024.

Rationale for this change

Currently decimal multiplication in DataFusion silently truncates precision of result. It happens generally for regular decimal multiplication which doesn't overflow. Looks like DataFusion uses incomplete decimal precision coercion rule from Spark to coerce sides of decimal multiplication (and other arithmetic operators). The coerced type on two sides of decimal multiplication is not the resulting decimal type of multiplication. This (and how we computes decimal multiplication in the kernels) leads to truncated precision in the result decimal type.

What changes are included in this PR?

  • Moved decimal type coercion for math binary operators from TypeCoercion to physical binary operator
  • Fixed type coercion rule for decimal
    • Produced correct coerced types
    • Separated result type from coerced type

Are these changes tested?

Are there any user-facing changes?

@viirya viirya marked this pull request as draft April 12, 2023 20:01
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates labels Apr 12, 2023
@viirya
Copy link
Member Author

viirya commented Apr 12, 2023

Different to #5675, this doesn't add new expression node PromotePrecision and defers decimal type coercion to the phase of math expression evaluation. This approach is more close to how Spark handles decimal math coercion nowadays.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 12, 2023
@viirya viirya force-pushed the fix_decimal_multiply_precision_loss4 branch 2 times, most recently from 54397f9 to 343ca79 Compare April 13, 2023 21:34
@viirya
Copy link
Member Author

viirya commented Apr 16, 2023

There is a compilation error. Going to fix it at #6029.

@viirya viirya force-pushed the fix_decimal_multiply_precision_loss4 branch from cb7e326 to 0a88516 Compare April 17, 2023 01:18
Comment on lines +3313 to +3320
Some(99193548387), // 0.99193548387
None,
None,
Some(100813008130), // 1.0081300813
Some(100000000000), // 1.0
],
21,
11,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this division losses precision. Now we get it back.

// subtract: decimal array subtract int32 array
let schema = Arc::new(Schema::new(vec![
Field::new("b", DataType::Int32, true),
Field::new("a", DataType::Decimal128(10, 2), true),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the field order is incorrect. But as we did coerce type on both side of the op anyway, so it still worked before. Now we don't coerce the decimal field (which is wrongly bound to Int32Array) before into binary expression, so wrong field causes an error.

sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
sum(cast(l_extendedprice as decimal(12,2)) * (1 - l_discount) * (1 + l_tax)) as sum_charge,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +3 to +6
cast(cast(sum(case
when nation = 'BRAZIL' then volume
else 0
end) as decimal(12,2)) / cast(sum(volume) as decimal(12,2)) as decimal(15,2)) as mkt_share
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn i128_to_str(value: i128, precision: &u8, scale: &i8) -> String {
big_decimal_to_str(
BigDecimal::from_str(&Decimal::from_i128_with_scale(value, scale).to_string())
BigDecimal::from_str(&Decimal128Type::format_decimal(value, *precision, *scale))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya viirya marked this pull request as ready for review April 17, 2023 21:35
@viirya
Copy link
Member Author

viirya commented Apr 17, 2023

This deals with the decimal precision issue without additional PromotePrecision node (#5675).

cc @alamb @liukun4515

@Dandandan
Copy link
Contributor

I wonder if this already fixes #4024

@viirya
Copy link
Member Author

viirya commented Apr 18, 2023

I wonder if this already fixes #4024

Yea, just verified locally that this can pass verify_q6.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@viirya
Copy link
Member Author

viirya commented Apr 18, 2023

Thanks @Dandandan

@Dandandan
Copy link
Contributor

Let's wait ~24hrs so other reviewers can have a chance.

@Dandandan
Copy link
Contributor

FYI @mingmwang @andygrove this PR also has some effect on performance, as casting is changed (mostly reduced).

@Dandandan
Copy link
Contributor

Ran the benchmarks for TPCH(SF=1) in memory.

Performance is mostly the same, except a ~30% improvement for q1 compared to main 🚀

@Dandandan Dandandan merged commit e81f54b into apache:main Apr 20, 2023
@alamb
Copy link
Contributor

alamb commented Apr 24, 2023

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

3 participants