Adaptive Row Block Size (#4812) by tustvold · Pull Request #4818 · apache/arrow-rs

tustvold · 2023-09-14T17:28:24Z

Which issue does this PR close?

Rationale for this change

Reduces the space amplification for small strings, reducing memory usage and potentially yielding faster comparisons

convert_columns 4096 string(10, 0)
                        time:   [68.766 µs 68.782 µs 68.802 µs]
                        change: [-0.2111% +0.1136% +0.3981%] (p = 0.56 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string(10, 0)
                        time:   [68.588 µs 68.599 µs 68.610 µs]
                        change: [+0.0308% +0.3071% +0.5887%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 string(10, 0)
                        time:   [74.786 µs 74.810 µs 74.838 µs]
                        change: [+18.688% +18.835% +19.078%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string(30, 0)
                        time:   [73.300 µs 73.469 µs 73.619 µs]
                        change: [+6.1730% +6.5624% +6.9695%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

convert_columns_prepared 4096 string(30, 0)
                        time:   [72.918 µs 73.091 µs 73.262 µs]
                        change: [+6.5088% +6.8782% +7.2161%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

convert_rows 4096 string(30, 0)
                        time:   [87.594 µs 87.620 µs 87.649 µs]
                        change: [+39.299% +39.758% +40.202%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns 4096 string(100, 0)
                        time:   [82.717 µs 82.758 µs 82.802 µs]
                        change: [-7.4082% -7.2740% -7.0869%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

convert_columns_prepared 4096 string(100, 0)
                        time:   [83.181 µs 83.212 µs 83.247 µs]
                        change: [-5.2730% -5.1499% -4.9101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string(100, 0)
                        time:   [126.84 µs 126.90 µs 126.97 µs]
                        change: [+21.245% +21.428% +21.575%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

convert_columns 4096 string(100, 0.5)
                        time:   [95.168 µs 95.188 µs 95.214 µs]
                        change: [-6.6299% -6.3326% -6.0687%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string(100, 0.5)
                        time:   [95.093 µs 95.118 µs 95.147 µs]
                        change: [-6.2227% -6.1799% -6.1403%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild

convert_rows 4096 string(100, 0.5)
                        time:   [117.71 µs 117.73 µs 117.76 µs]
                        change: [+4.9569% +5.2699% +5.5499%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

convert_columns 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0)
                        time:   [293.24 µs 293.40 µs 293.56 µs]
                        change: [-2.6997% -2.6362% -2.5681%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

convert_columns_prepared 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0)
                        time:   [290.93 µs 291.09 µs 291.26 µs]
                        change: [-3.3535% -3.2713% -3.1558%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

convert_rows 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0)
                        time:   [315.67 µs 315.75 µs 315.84 µs]
                        change: [+17.810% +17.862% +17.915%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

This does noticeably regress the performance of converting rows back into arrow-arrays, as a result of having to parse additional blocks. I am inclined to think this isn't a major concern as I have only seen this functionality being used following expensive, reducing operations like grouping, which will dominate execution time

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-09-14T17:33:57Z


        for r in r2.iter() {
-            assert_eq!(r.data.len(), 34);
+            assert_eq!(r.data.len(), 10);


This is a nice demonstration of how this will reduce the impact of #4811. Whilst 10 is still worse than the 3 for dictionaries, it is a lot better than 34 😅

alamb

I haven't had a chance to read this PR in detail yet. In terms of performance regression, is there some way to dynamically pick the number of blocks row by row (based on their contents) rather than hard coding it to always use a few small blocks up front)? That way we could avoid the performance regression for larger strings

alamb · 2023-09-14T19:46:00Z

 /// with `0_u8` and written to the output, followed by the un-padded length in bytes
-/// of this final block as a `u8`.
+/// of this final block as a `u8`. The first 4 blocks have a length of 8, with subsequent
+/// blocks using a length of 32.


I think it would help to explain the rationale for using smaller blocks up front (to avoid space wastage for smaller stings)

tustvold · 2023-09-14T20:01:03Z

dynamically pick the number of blocks row by row

I can't think of a scheme that would allow this whilst also allowing shorter strings to compare correctly against longer, e.g. "abc" < "bcde" < "cde"

alamb

Thank you @tustvold -- I went over this change carefully

Here is my summary of expected benefits of this PR. I wonder if you agree with this assesment?

GROUP BY and ORDER BY with low cardinality multi column string grouping / ordering with strings less than 32 characters I would expect smaller keys to make comparisions faster when the Row values are equal many of the times -- which is what happens for GROUP BY on low cardinality grouping operations in DataFusion. TPCH Q1 is an example (return_status is a string)
Memory usage when doing high cardinality multi-column grouping or sorting with strings less than 32 characaters Since there is less memory overhead per value the overall memory requirements are lower.

TLDR is I think this it is the right tradeoff for IOx but I can imagine workloads where one would rather have the faster conversion performance but larger memory usage.

I wonder if anyone else in the community has an opinion?

alamb · 2023-09-15T14:35:33Z

    }

    impl ByteArrayNativeType for [u8] {
+        #[inline]


inline(always)? -- at some point I thought that saying 'always' was required to get cross crate inlining to work

That hasn't been my experience, I believe #[inline] makes the data available to be inlined should LLVM think it a good idea, with '[inline(always)] only necessary when LLVM doesn't think it a good idea (for whatever reason).

alamb · 2023-09-15T14:56:08Z

 /// The block size of the variable length encoding
 pub const BLOCK_SIZE: usize = 32;

+/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks


Perhaps this would be a good place to add comments about the why for miniblocks.

alamb · 2023-09-15T14:59:24Z

+        Some(a) if a <= BLOCK_SIZE => {
+            1 + ceil(a, MINI_BLOCK_SIZE) * (MINI_BLOCK_SIZE + 1)
+        }
+        Some(a) => MINI_BLOCK_COUNT + ceil(a, BLOCK_SIZE) * (BLOCK_SIZE + 1),


I don't understand why the padded length gets a MINI_BLOCK_COUNT on it. Is it because each miniblock contains 1 byte continuation?

Maybe a comment would help

alamb · 2023-09-15T15:01:54Z

 pub const BLOCK_SIZE: usize = 32;

+/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks
+pub const MINI_BLOCK_COUNT: usize = 4;


I wonder if using a MINI_BLOCK count of 2 might get most of the memory savings but have lower CPU overhead 🤔

I didn't find this to yield a meaningful performance delta

Thank you for trying

alamb · 2023-09-15T15:04:57Z

-/// Returns the number of bytes of encoded data
-fn decoded_len(row: &[u8], options: SortOptions) -> usize {
+#[inline]
+fn encode_blocks<const SIZE: usize>(out: &mut [u8], val: &[u8]) -> usize {


perhaps we could add a comment here explaining what this writes (specifically continuation tokens vs BLOCK_CONTINUATION)

tustvold · 2023-09-16T12:19:54Z

faster conversion performance but larger memory usage.

The only regression is converting back from rows to arrays, so for DF I would not expect this to regress performance as this only done for grouping where other overheads will dominate

tustvold added 3 commits September 14, 2023 15:04

Adaptive Row Block Size (apache#4812)

6e24ce8

Perf improvements

8e3b884

Further tweaks

fe93e7a

github-actions Bot added the arrow Changes to the arrow crate label Sep 14, 2023

tustvold commented Sep 14, 2023

View reviewed changes

alamb reviewed Sep 14, 2023

View reviewed changes

alamb approved these changes Sep 15, 2023

View reviewed changes

tustvold added 2 commits September 17, 2023 10:52

Merge remote-tracking branch 'upstream/master' into adaptive-block-size

8debf44

Review feedback

8640ed9

tustvold merged commit b64e362 into apache:master Sep 17, 2023

tustvold mentioned this pull request Sep 18, 2023

Row Format Adapative Block Size #4812

Closed

tustvold mentioned this pull request May 28, 2024

Revisit List Row Encoding (#5807) #5811

Merged

Conversation

tustvold commented Sep 14, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold commented Sep 14, 2023

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold commented Sep 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants