Adaptive Row Block Size (#4812)#4818
Conversation
|
|
||
| for r in r2.iter() { | ||
| assert_eq!(r.data.len(), 34); | ||
| assert_eq!(r.data.len(), 10); |
There was a problem hiding this comment.
This is a nice demonstration of how this will reduce the impact of #4811. Whilst 10 is still worse than the 3 for dictionaries, it is a lot better than 34 😅
alamb
left a comment
There was a problem hiding this comment.
I haven't had a chance to read this PR in detail yet. In terms of performance regression, is there some way to dynamically pick the number of blocks row by row (based on their contents) rather than hard coding it to always use a few small blocks up front)? That way we could avoid the performance regression for larger strings
| /// with `0_u8` and written to the output, followed by the un-padded length in bytes | ||
| /// of this final block as a `u8`. | ||
| /// of this final block as a `u8`. The first 4 blocks have a length of 8, with subsequent | ||
| /// blocks using a length of 32. |
There was a problem hiding this comment.
I think it would help to explain the rationale for using smaller blocks up front (to avoid space wastage for smaller stings)
I can't think of a scheme that would allow this whilst also allowing shorter strings to compare correctly against longer, e.g. "abc" < "bcde" < "cde" |
alamb
left a comment
There was a problem hiding this comment.
Thank you @tustvold -- I went over this change carefully
Here is my summary of expected benefits of this PR. I wonder if you agree with this assesment?
-
GROUP BYandORDER BYwith low cardinality multi column string grouping / ordering with strings less than 32 characters I would expect smaller keys to make comparisions faster when theRowvalues are equal many of the times -- which is what happens forGROUP BYon low cardinality grouping operations in DataFusion. TPCH Q1 is an example (return_statusis a string) -
Memory usage when doing high cardinality multi-column grouping or sorting with strings less than 32 characaters Since there is less memory overhead per value the overall memory requirements are lower.
TLDR is I think this it is the right tradeoff for IOx but I can imagine workloads where one would rather have the faster conversion performance but larger memory usage.
I wonder if anyone else in the community has an opinion?
| } | ||
|
|
||
| impl ByteArrayNativeType for [u8] { | ||
| #[inline] |
There was a problem hiding this comment.
inline(always)? -- at some point I thought that saying 'always' was required to get cross crate inlining to work
There was a problem hiding this comment.
That hasn't been my experience, I believe #[inline] makes the data available to be inlined should LLVM think it a good idea, with '[inline(always)] only necessary when LLVM doesn't think it a good idea (for whatever reason).
| /// The block size of the variable length encoding | ||
| pub const BLOCK_SIZE: usize = 32; | ||
|
|
||
| /// The first block is split into `MINI_BLOCK_COUNT` mini-blocks |
There was a problem hiding this comment.
Perhaps this would be a good place to add comments about the why for miniblocks.
| Some(a) if a <= BLOCK_SIZE => { | ||
| 1 + ceil(a, MINI_BLOCK_SIZE) * (MINI_BLOCK_SIZE + 1) | ||
| } | ||
| Some(a) => MINI_BLOCK_COUNT + ceil(a, BLOCK_SIZE) * (BLOCK_SIZE + 1), |
There was a problem hiding this comment.
I don't understand why the padded length gets a MINI_BLOCK_COUNT on it. Is it because each miniblock contains 1 byte continuation?
Maybe a comment would help
| pub const BLOCK_SIZE: usize = 32; | ||
|
|
||
| /// The first block is split into `MINI_BLOCK_COUNT` mini-blocks | ||
| pub const MINI_BLOCK_COUNT: usize = 4; |
There was a problem hiding this comment.
I wonder if using a MINI_BLOCK count of 2 might get most of the memory savings but have lower CPU overhead 🤔
There was a problem hiding this comment.
I didn't find this to yield a meaningful performance delta
| /// Returns the number of bytes of encoded data | ||
| fn decoded_len(row: &[u8], options: SortOptions) -> usize { | ||
| #[inline] | ||
| fn encode_blocks<const SIZE: usize>(out: &mut [u8], val: &[u8]) -> usize { |
There was a problem hiding this comment.
perhaps we could add a comment here explaining what this writes (specifically continuation tokens vs BLOCK_CONTINUATION)
The only regression is converting back from rows to arrays, so for DF I would not expect this to regress performance as this only done for grouping where other overheads will dominate |
Which issue does this PR close?
Closes #4812
Rationale for this change
Reduces the space amplification for small strings, reducing memory usage and potentially yielding faster comparisons
This does noticeably regress the performance of converting rows back into arrow-arrays, as a result of having to parse additional blocks. I am inclined to think this isn't a major concern as I have only seen this functionality being used following expensive, reducing operations like grouping, which will dominate execution time
What changes are included in this PR?
Are there any user-facing changes?