perf: speed up format v2.2 scans by adding shortcut for full page#5981
perf: speed up format v2.2 scans by adding shortcut for full page#5981
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
PR Review: perf: refine FullZip repetition index scheduling and cachingOverall: Clean refactoring that simplifies the API and improves I/O scheduling. No P0/P1 issues found. Summary of Changes
Minor Observations
LGTM 👍 |
westonpace
left a comment
There was a problem hiding this comment.
Can we at least leave in the option to disable it?
Additionally, the rep index cache grows linearly; for 200k rows it occupies about 1.6 MiB. This cache will be managed by our global metadata cache. So I think it's totally ok for us to handle it.
I am a little bit worried about this. If we have billions of rows in a dataset won't that mean we need GBs of RAM (per string/binary column)?
If the goal is to improve full scan performance I think there is potentially another way. If we know we are going to load an entire page then we could shortcut and just read the entire page. Then schedule_ranges could receive a special reader that just returns slices of the page data. This way we avoid needing two stages of I/O for full scans.
Yep, will do.
Seems to be an interesting idea, will give it a try first. |
|
Hi @westonpace, I added a The latest bench result:
Compared to previous impls:
A bit slower on local but I think that's fine. |
westonpace
left a comment
There was a problem hiding this comment.
I like the shortcut with FullZipReadSource, that looks great. Do we need to get rid of the caching ability though? I think it still might be useful for random access cases.
| /// Cached state containing the decoded repetition index | ||
| cached_state: Option<Arc<FullZipCacheableState>>, | ||
| /// Whether to enable caching of repetition indices | ||
| enable_cache: bool, |
There was a problem hiding this comment.
Why get rid of this? It might still be useful for users that want 1 IOP random access on relatively small amounts of data?
There was a problem hiding this comment.
I tried to always enable cache before and now we have a shortcut, we can enable the flag back.,
…rep-index # Conflicts: # rust/lance-encoding/src/encodings/logical/primitive.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…rep-index # Conflicts: # rust/lance-encoding/src/encodings/logical/primitive.rs
westonpace
left a comment
There was a problem hiding this comment.
Awesome, thanks for bearing with the review! Also, I'm really happy to see this fix get in, I always felt it was kind of embarrassing we were doing two IOPS in this full page case 😰. I just never got around to fixing it 😆
…nce-format#5981) This PR addresses a long-standing issue in the Lance file format (both v2.1 and v2.2) where the rep index must be loaded before reading any full-zipped values. This could cause serious HoL blocking, especially when data is stored on low-latency, high-throughput services like S3. @westonpace previously reported this issue at lance-format#3579. While we already support user requests to cache this index, I implemented that feature. Now I believe we should always cache it by default, as it is low-cost and highly beneficial. Based on a full scan test using the FineWeb dataset, I observed the following improvements: **Local dataset** - p50: 900,363 µs → 542,266 µs (1.66x faster, –39.8%) - p95: 951,820 µs → 579,505 µs (1.64x faster) - p99: 994,691 µs → 713,062 µs (1.40x faster) **S3 dataset** - p50: 3,981,524 µs → 990,825 µs (4.02x faster, –75.1%) - p95: 4,056,506 µs → 1,124,499 µs (3.61x faster) - p99: 4,106,640 µs → 1,207,027 µs (3.40x faster) Additionally, the rep index cache grows linearly; for 200k rows it occupies about 1.6 MiB. This cache will be managed by our global metadata cache. So I think it's totally ok for us to handle it. --- This PR includes the following changes: - always cache the repetition index when present and populate `cached_state` immediately - split io submission so cached paths submit reads before awaiting and keep non-cached behavior for fallback - drop unused cache flag/parameter plumbing and update full zip cache test expectations --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`), amp (with `claude-4.6`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This PR addresses a long-standing issue in the Lance file format (both v2.1 and v2.2) where the rep index must be loaded before reading any full-zipped values. This could cause serious HoL blocking, especially when data is stored on low-latency, high-throughput services like S3.
@westonpace previously reported this issue at #3579. While we already support user requests to cache this index, I implemented that feature. Now I believe we should always cache it by default, as it is low-cost and highly beneficial.
Based on a full scan test using the FineWeb dataset, I observed the following improvements:
Local dataset
S3 dataset
Additionally, the rep index cache grows linearly; for 200k rows it occupies about 1.6 MiB. This cache will be managed by our global metadata cache. So I think it's totally ok for us to handle it.
This PR includes the following changes:
cached_stateimmediatelyParts of this PR were drafted with assistance from Codex (with
gpt-5.3-codex), amp (withclaude-4.6) and fully reviewed and edited by me. I take full responsibility for all changes.