Skip to content

perf: add a chunk cache to avoid decoding duplicated miniblock chunks#4846

Merged
westonpace merged 1 commit intolance-format:mainfrom
niyue:feature/chunk-cache
Nov 4, 2025
Merged

perf: add a chunk cache to avoid decoding duplicated miniblock chunks#4846
westonpace merged 1 commit intolance-format:mainfrom
niyue:feature/chunk-cache

Conversation

@niyue
Copy link
Copy Markdown
Contributor

@niyue niyue commented Sep 30, 2025

Description

When miniblock encoding is used in a Lance file, reading the file with the v2 FileReader via the read_stream_projected API can become inefficient if the provided ReadBatchParams::Indices contains many nearby but non-contiguous row indices.
For example:

29, 168, 180, 194, 376, 559, 574, 665, 666, 667, ..., 968, 969, 970, 973, 975, ...

This kind of access pattern causes the same chunk to be decoded repeatedly, resulting in slow performance and high CPU usage.

Solution

This PR introduces a lightweight single-entry cache in DecodePageTask. While it only helps when chunks are accessed in a somewhat sequential manner, row indices are typically sorted in ascending order, so the cache strikes a balance between saving memory and improving performance.

Test

On a local setup with a Lance file containing 100k rows (each row with a text column of 200+ bytes):

  • Reading 1700+ nearby but non-contiguous rows at random
  • zstd is used for general compression
  • With this change, performance improved by 3x–5x, depending on the dataset.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

// Now we iterate through each instruction and process it
for (instructions, chunk) in self.instructions.iter() {
// TODO: It's very possible that we have duplicate `buf` in self.instructions and we
// don't want to decode the buf again and again on the same thread.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR partially addresses this TODO. It improves performance unless chunks are accessed in a fully random pattern, which would require a HashMap-based cache at the cost of higher memory usage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunks should always be accessed sequentially I believe. We have a requirement at some point in the decoding process for offsets / ranges to be in sorted order.

@niyue niyue changed the title Add a chunk cache to avoid decoding duplicated miniblock chunks perf: add a chunk cache to avoid decoding duplicated miniblock chunks Sep 30, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.67%. Comparing base (7e65e8b) to head (24f9d3f).

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 94.73% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4846      +/-   ##
==========================================
- Coverage   81.67%   81.67%   -0.01%     
==========================================
  Files         334      334              
  Lines      132492   132508      +16     
  Branches   132492   132508      +16     
==========================================
+ Hits       108215   108227      +12     
- Misses      20640    20645       +5     
+ Partials     3637     3636       -1     
Flag Coverage Δ
unittests 81.67% <94.73%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@niyue niyue force-pushed the feature/chunk-cache branch 2 times, most recently from 611a69c to b7620b1 Compare September 30, 2025 09:03
@niyue niyue force-pushed the feature/chunk-cache branch from b7620b1 to 8632a91 Compare October 9, 2025 02:16
@niyue
Copy link
Copy Markdown
Contributor Author

niyue commented Oct 9, 2025

This PR is ready for review.

The CI still reports one test failure on mac-build (index::vector::ivf::v2::tests::test_build_ivf_pq_4bit::case_3), I tried rebased onto the latest of main branch, and it is still failing. But I’m unable to reproduce it locally on my MacBook. I’m not very familiar with the failing test, it passed in one of my previous pushes, and it doesn’t appear to touch the code paths I modified at all. Please let me know if you think it’s related — I’ll be happy to investigate further. Thanks!

@niyue
Copy link
Copy Markdown
Contributor Author

niyue commented Oct 13, 2025

I confirmed that the test case index::vector::ivf::v2::tests::test_build_ivf_pq_4bit::case_3 is flaky — when running it on the main branch (commit 7e65e8b0) on my MacBook locally, 4 out of 50 runs failed.

@niyue
Copy link
Copy Markdown
Contributor Author

niyue commented Oct 13, 2025

I rebased onto the latest main branch and encountered another flaky test dataset::optimize::tests::test_read_btree_index_with_defer_index_remap (as reported here). I’ll leave the code unchanged for now.

@niyue
Copy link
Copy Markdown
Contributor Author

niyue commented Oct 22, 2025

@westonpace could you please help review this PR when you have a moment? I believe it partially addresses one of the comments you made in the code you previously wrote for this part

@niyue
Copy link
Copy Markdown
Contributor Author

niyue commented Oct 30, 2025

Hi @westonpace, just wanted to gently check in to see if you’ve had a chance to take a look at this PR.
No rush at all — just making sure it didn’t slip through the cracks 😊

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, very nice. Do you have a test case / benchmark of any kind that you've been running to verify performance? No need to get it in for this PR but in a future PR it might be nice to add some kind of benchmark like that to help prevent regressions. I guess it would be a benchmark that is reading every other row (or a bunch of rows in the same page) or something like that.

// Now we iterate through each instruction and process it
for (instructions, chunk) in self.instructions.iter() {
// TODO: It's very possible that we have duplicate `buf` in self.instructions and we
// don't want to decode the buf again and again on the same thread.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunks should always be accessed sequentially I believe. We have a requirement at some point in the decoding process for offsets / ranges to be in sorted order.

…us chunks don't have to be decoded multiple times.
@westonpace
Copy link
Copy Markdown
Member

Rebased and will merge on green

@westonpace
Copy link
Copy Markdown
Member

Hi @westonpace, just wanted to gently check in to see if you’ve had a chance to take a look at this PR.
No rush at all — just making sure it didn’t slip through the cracks 😊

Sorry, this took way longer to get to than it should have.

@westonpace westonpace merged commit b229e47 into lance-format:main Nov 4, 2025
26 of 27 checks passed
@niyue
Copy link
Copy Markdown
Contributor Author

niyue commented Nov 4, 2025

Do you have a test case / benchmark of any kind that you've been running to verify performance?

I tested it within my application, as described in the Test section of this PR description. The workload involves a sequential but non-contiguous row access pattern, retrieving about 2% of the data. With this enhancement, I observed a 3x–5x performance improvement. I expect block-based compression schemes such as Zstd and LZ4 will show similar performance improvements.

I can try to add a benchmark within Lance itself to verify this improvement more systematically later. Thanks.

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…lance-format#4846)

# Description
When miniblock encoding is used in a Lance file, reading the file with
the v2 `FileReader` via the `read_stream_projected` API can become
inefficient if the provided `ReadBatchParams::Indices` contains many
nearby but non-contiguous row indices.
For example:
```
29, 168, 180, 194, 376, 559, 574, 665, 666, 667, ..., 968, 969, 970, 973, 975, ...
```
This kind of access pattern causes the same chunk to be decoded
repeatedly, resulting in slow performance and high CPU usage.

# Solution
This PR introduces a lightweight single-entry cache in `DecodePageTask`.
While it only helps when chunks are accessed in a somewhat sequential
manner, row indices are typically sorted in ascending order, so the
cache strikes a balance between saving memory and improving performance.

# Test
On a local setup with a Lance file containing 100k rows (each row with a
text column of 200+ bytes):
* Reading 1700+ nearby but non-contiguous rows at random
* `zstd` is used for general compression
* With this change, performance improved by 3x–5x, depending on the
dataset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants