perf: add a chunk cache to avoid decoding duplicated miniblock chunks#4846
perf: add a chunk cache to avoid decoding duplicated miniblock chunks#4846westonpace merged 1 commit intolance-format:mainfrom
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
| // Now we iterate through each instruction and process it | ||
| for (instructions, chunk) in self.instructions.iter() { | ||
| // TODO: It's very possible that we have duplicate `buf` in self.instructions and we | ||
| // don't want to decode the buf again and again on the same thread. |
There was a problem hiding this comment.
This PR partially addresses this TODO. It improves performance unless chunks are accessed in a fully random pattern, which would require a HashMap-based cache at the cost of higher memory usage.
There was a problem hiding this comment.
Chunks should always be accessed sequentially I believe. We have a requirement at some point in the decoding process for offsets / ranges to be in sorted order.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4846 +/- ##
==========================================
- Coverage 81.67% 81.67% -0.01%
==========================================
Files 334 334
Lines 132492 132508 +16
Branches 132492 132508 +16
==========================================
+ Hits 108215 108227 +12
- Misses 20640 20645 +5
+ Partials 3637 3636 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
611a69c to
b7620b1
Compare
b7620b1 to
8632a91
Compare
|
This PR is ready for review. The CI still reports one test failure on |
|
I confirmed that the test case |
8632a91 to
24f9d3f
Compare
|
I rebased onto the latest main branch and encountered another flaky test |
|
@westonpace could you please help review this PR when you have a moment? I believe it partially addresses one of the comments you made in the code you previously wrote for this part |
|
Hi @westonpace, just wanted to gently check in to see if you’ve had a chance to take a look at this PR. |
westonpace
left a comment
There was a problem hiding this comment.
Oh, very nice. Do you have a test case / benchmark of any kind that you've been running to verify performance? No need to get it in for this PR but in a future PR it might be nice to add some kind of benchmark like that to help prevent regressions. I guess it would be a benchmark that is reading every other row (or a bunch of rows in the same page) or something like that.
| // Now we iterate through each instruction and process it | ||
| for (instructions, chunk) in self.instructions.iter() { | ||
| // TODO: It's very possible that we have duplicate `buf` in self.instructions and we | ||
| // don't want to decode the buf again and again on the same thread. |
There was a problem hiding this comment.
Chunks should always be accessed sequentially I believe. We have a requirement at some point in the decoding process for offsets / ranges to be in sorted order.
…us chunks don't have to be decoded multiple times.
24f9d3f to
3ef8e50
Compare
|
Rebased and will merge on green |
Sorry, this took way longer to get to than it should have. |
I tested it within my application, as described in the I can try to add a benchmark within Lance itself to verify this improvement more systematically later. Thanks. |
…lance-format#4846) # Description When miniblock encoding is used in a Lance file, reading the file with the v2 `FileReader` via the `read_stream_projected` API can become inefficient if the provided `ReadBatchParams::Indices` contains many nearby but non-contiguous row indices. For example: ``` 29, 168, 180, 194, 376, 559, 574, 665, 666, 667, ..., 968, 969, 970, 973, 975, ... ``` This kind of access pattern causes the same chunk to be decoded repeatedly, resulting in slow performance and high CPU usage. # Solution This PR introduces a lightweight single-entry cache in `DecodePageTask`. While it only helps when chunks are accessed in a somewhat sequential manner, row indices are typically sorted in ascending order, so the cache strikes a balance between saving memory and improving performance. # Test On a local setup with a Lance file containing 100k rows (each row with a text column of 200+ bytes): * Reading 1700+ nearby but non-contiguous rows at random * `zstd` is used for general compression * With this change, performance improved by 3x–5x, depending on the dataset.
Description
When miniblock encoding is used in a Lance file, reading the file with the v2
FileReadervia theread_stream_projectedAPI can become inefficient if the providedReadBatchParams::Indicescontains many nearby but non-contiguous row indices.For example:
This kind of access pattern causes the same chunk to be decoded repeatedly, resulting in slow performance and high CPU usage.
Solution
This PR introduces a lightweight single-entry cache in
DecodePageTask. While it only helps when chunks are accessed in a somewhat sequential manner, row indices are typically sorted in ascending order, so the cache strikes a balance between saving memory and improving performance.Test
On a local setup with a Lance file containing 100k rows (each row with a text column of 200+ bytes):
zstdis used for general compression