Skip to content

perf: Optimize decimal precision check in decimal aggregates (sum and avg)#952

Merged
andygrove merged 22 commits intoapache:mainfrom
andygrove:is_valid_decimal_precision
Sep 24, 2024
Merged

perf: Optimize decimal precision check in decimal aggregates (sum and avg)#952
andygrove merged 22 commits intoapache:mainfrom
andygrove:is_valid_decimal_precision

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Sep 18, 2024

Which issue does this PR close?

Part of #951

Builds on #948

Rationale for this change

I noticed two areas of overhead in the current approach to verifying decimal precision in decimal aggregates sum and avg:

  1. In the event of an overflow, we are building a formatted string for an error message, which is then discarded
  2. The range check code introduces a memcpy which can be avoided

I tested the following variations of the decimal precision check in Rust playground.

fn validate_decimal_precision1(value: i128, precision: u8) -> bool {
    if precision > DECIMAL128_MAX_PRECISION {
        return false;
    }
    let idx = usize::from(precision) - 1;
    value >= MIN_DECIMAL_FOR_EACH_PRECISION[idx] && value <= MAX_DECIMAL_FOR_EACH_PRECISION[idx]
}

// based on arrow-rs version
fn validate_decimal_precision2(value: i128, precision: u8) -> bool {
    if precision > DECIMAL128_MAX_PRECISION {
        return false;
    }

    let max = MAX_DECIMAL_FOR_EACH_PRECISION[usize::from(precision) - 1];
    let min = MIN_DECIMAL_FOR_EACH_PRECISION[usize::from(precision) - 1];

    if value > max {
        false
    } else if value < min {
        false
    } else {
        true
    }
}

validate_decimal_precision1 avoids a memcpy that appears in validate_decimal_precision2:

playground::validate_decimal_precision1:
	subq	$1304, %rsp
	movb	%dl, %al
	movb	%al, 23(%rsp)
	movq	%rsi, 24(%rsp)
	movq	%rdi, 32(%rsp)
	movq	%rdi, 1264(%rsp)
	movq	%rsi, 1272(%rsp)
	movb	%al, 1287(%rsp)
	cmpb	$38, %al
	ja	.LBB9_2
	movb	23(%rsp), %al
	movb	%al, 1303(%rsp)
	movzbl	%al, %eax
	movq	%rax, %rcx
	subq	$1, %rcx
	movq	%rcx, 8(%rsp)
	cmpq	$1, %rax
	jb	.LBB9_4
	jmp	.LBB9_3

playground::validate_decimal_precision2:
	subq	$1368, %rsp
	movb	%dl, %al
	movb	%al, 55(%rsp)
	movq	%rsi, 56(%rsp)
	movq	%rdi, 64(%rsp)
	movq	%rdi, 1296(%rsp)
	movq	%rsi, 1304(%rsp)
	movb	%al, 1327(%rsp)
	cmpb	$38, %al
	ja	.LBB10_2
	leaq	80(%rsp), %rdi
	leaq	.L__unnamed_5(%rip), %rsi
	movl	$608, %edx
	callq	memcpy@PLT   <----- MEMCPY HERE
	movb	55(%rsp), %al
	movb	%al, 1367(%rsp)
	movzbl	%al, %eax
	movq	%rax, %rcx
	subq	$1, %rcx
	movq	%rcx, 40(%rsp)
	cmpq	$1, %rax
	jb	.LBB10_4
	jmp	.LBB10_3

What changes are included in this PR?

  • New precision check that returns bool instead of Err and avoids a memcpy
  • Added a Rust unit test for SumDecimal to make profiling easier
  • Minor code improvements

How are these changes tested?

  • New benchmark
  • Existing tests

@andygrove
Copy link
Copy Markdown
Member Author

Benchmark results:

aggregate/avg_decimal_comet
                        time:   [905.95 µs 907.59 µs 909.32 µs]
                        change: [-4.4075% -4.0254% -3.6624%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

aggregate/sum_decimal_comet
                        time:   [781.78 µs 786.56 µs 791.40 µs]
                        change: [-15.686% -15.052% -14.420%] (p = 0.00 < 0.05)
                        Performance has improved.

@andygrove
Copy link
Copy Markdown
Member Author

10 runs of TPC-H q1 @ 100 GB:

main branch


 "1": [
        13.25664234161377,
        11.419574975967407,
        11.131508111953735,
        11.158952474594116,
        11.132988929748535,
        11.101924657821655,
        11.077952146530151,
        11.080657720565796,
        11.011901378631592,
        11.053855419158936
    ]

this PR

"1": [
        12.989612579345703,
        10.942750453948975,
        10.66578459739685,
        10.707809686660767,
        10.62846064567566,
        10.636594295501709,
        10.699990034103394,
        10.733628034591675,
        10.689185380935669,
        10.626504182815552
    ]

@andygrove
Copy link
Copy Markdown
Member Author

aggregate/avg_decimal_datafusion
                        time:   [648.82 µs 651.35 µs 654.29 µs]
aggregate/avg_decimal_comet
                        time:   [882.83 µs 883.65 µs 884.48 µs]
aggregate/sum_decimal_datafusion
                        time:   [695.71 µs 696.20 µs 696.70 µs]
aggregate/sum_decimal_comet
                        time:   [784.45 µs 786.38 µs 788.42 µs]

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.81%. Comparing base (cbaf1c2) to head (5381327).
Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #952      +/-   ##
============================================
+ Coverage     33.80%   33.81%   +0.01%     
+ Complexity      852      851       -1     
============================================
  Files           112      112              
  Lines         43276    43286      +10     
  Branches       9572     9572              
============================================
+ Hits          14629    14639      +10     
  Misses        25634    25634              
  Partials       3013     3013              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove andygrove marked this pull request as ready for review September 19, 2024 14:18
@andygrove
Copy link
Copy Markdown
Member Author

Comparison between main branch and this PR:

tpch_allqueries

tpch_queries_speedup_rel

@mbutrovich
Copy link
Copy Markdown
Contributor

I was a bit surprised to see a performance win from changing an if-else to a compound boolean expression, since this seems like something that an optimizing compiler should handle well. I think I confirmed that part by putting the example code above in the Rust playground, and in Release mode the memcpy doesn't exist. If I leave it in Debug mode, I see the memcpy the same as described above. That doesn't explain the real performance win in Comet, however.

My next guess was that hoisting the implementation from arrow-rs into Comet enabled better inlining opportunities. My ARM assembly is not as proficient as x86, but I believe the relevant bits are below. This is current main branch disassembly for comet::execution::datafusion::expressions::avg_decimal::AvgDecimalGroupsAccumulator::update_single:

bl arrow_data::decimal::validate_decimal_precision

There's a branch and link that isn't present in this PR's disassembly. The compiler is able to better inline the hoisted code.

I am not as familiar with Rust's build environment, so I'm not sure if this is expected when calling into code from other crates. I see Comet currently does thin LTO. I am curious if full would enable better inlining of functions like this, or if we're just at the limits of what the compiler can do. In general, this makes me curious about the performance limits of arrow-rs kernels in hot code paths, and may guide our future optimizations.

@andygrove andygrove changed the title perf: Avoid memcpy during decimal precision check in decimal aggregates (sum and avg) perf: Optimize decimal precision check in decimal aggregates (sum and avg) Sep 22, 2024
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

fn is_nullable(&self) -> bool {
// SumDecimal is always nullable because overflows can cause null values
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this is true for ANSI.
It looks the previous code is also hardcoding true, but this may be a good time to file an issue if there is not yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I filed #961

@andygrove
Copy link
Copy Markdown
Member Author

I was a bit surprised to see a performance win from changing an if-else to a compound boolean expression, since this seems like something that an optimizing compiler should handle well. I think I confirmed that part by putting the example code above in the Rust playground, and in Release mode the memcpy doesn't exist. If I leave it in Debug mode, I see the memcpy the same as described above. That doesn't explain the real performance win in Comet, however.

My next guess was that hoisting the implementation from arrow-rs into Comet enabled better inlining opportunities. My ARM assembly is not as proficient as x86, but I believe the relevant bits are below. This is current main branch disassembly for comet::execution::datafusion::expressions::avg_decimal::AvgDecimalGroupsAccumulator::update_single:

bl arrow_data::decimal::validate_decimal_precision

There's a branch and link that isn't present in this PR's disassembly. The compiler is able to better inline the hoisted code.

I am not as familiar with Rust's build environment, so I'm not sure if this is expected when calling into code from other crates. I see Comet currently does thin LTO. I am curious if full would enable better inlining of functions like this, or if we're just at the limits of what the compiler can do. In general, this makes me curious about the performance limits of arrow-rs kernels in hot code paths, and may guide our future optimizations.

Thanks @mbutrovich. This is very insightful. I'd like to go ahead and merge this PR but would also like to have a better understanding of why this is actually faster.

@andygrove andygrove merged commit 50517f6 into apache:main Sep 24, 2024
@andygrove andygrove deleted the is_valid_decimal_precision branch September 24, 2024 14:35
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

N/A

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Apply OSS 0.3.0 changes.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

```
84cccf7 docs: Add notes for IntelliJ code size limits for code inspections. (apache#985)
dcc4a8a fix: The spilled_bytes metric of CometSortExec should be size instead of time (apache#984)
f64553b chore: fix compatibility guide (apache#978)
0ee7df8 chore: Enable additional CreateArray tests (apache#928)
a690e9d perf: Remove one redundant CopyExec for SMJ (apache#962)
a8156b5 chore: update rem expression guide (apache#976)
317a534 fix: Use the number of rows from underlying arrays instead of logical row count from RecordBatch (apache#972)
22561c4 doc: add documentation interlinks (apache#975)
b4de8e0 chore: Update benchmarks results based on 0.3.0-rc1 (apache#969)
94093f3 chore: fix publish-to-maven script (apache#966)
f31f6cc Generate changelog for 0.3.0 release (apache#964)
5663fc2 fix: div and rem by negative zero (apache#960)
50517f6 perf: Optimize decimal precision check in decimal aggregates (sum and avg) (apache#952)
5b3f7bc fix: CometScanExec on Spark 3.5.2 (apache#915)
8410c71 chore: clarify tarball installation (apache#959)
459b2b0 fix: window function range offset should be long instead of int (apache#733)
```


## How are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants