Skip to content

perf: avoid oversized variable buffers in full-zip scan batches#6013

Merged
Xuanwo merged 3 commits intomainfrom
xuanwo/laion10m-scan-full-fix
Feb 26, 2026
Merged

perf: avoid oversized variable buffers in full-zip scan batches#6013
Xuanwo merged 3 commits intomainfrom
xuanwo/laion10m-scan-full-fix

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Feb 25, 2026

VariableFullZipDecoder::drain used to pass the entire page-level variable data buffer to every batch while slicing only the batch offsets.
This broke the normalized variable-buffer assumption (offsets should be 0-based for the provided data window) and caused downstream variable decompressors to repeatedly size work against oversized input buffers.

Bencmarks:

  • scan-full: 34047us -> 15703us,2.168x
  • caption: 28431us -> 6979us,4.074x

Changes

  • Add LanceBuffer::slice_and_rebase_offsets in rust/lance-encoding/src/buffer.rs.
    • Slice data to the byte range referenced by the batch offsets.
    • Rebase offsets to start at zero.
    • Validate invalid/overflowing offsets.
  • Update variable full-zip batch drain path to use the new helper.

@github-actions github-actions Bot added the bug Something isn't working label Feb 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

Summary: Performance fix for variable full-zip decoding regression. The fix correctly slices and rebases variable-width data buffers to bound memory for each batch.

Assessment

No P0/P1 issues found. The implementation is sound:

  • Proper bounds validation (empty offsets, end < base, data bounds, per-offset validation)
  • The intentional copy at line 320 (Self::copy_slice(&sliced_data)) is necessary to decouple batch memory from the full page buffer
  • Test coverage for u32/u64 paths and invalid offset rejection

Minor Observations (not blocking)

  • Consider adding a test case for offsets that exceed data length to verify bounds checking
  • The empty offset slice edge case could have explicit test coverage

LGTM 👍

@Xuanwo Xuanwo changed the title fix(encoding): avoid oversized variable buffers in full-zip scan batches perf: avoid oversized variable buffers in full-zip scan batches Feb 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 76.31579% with 27 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 76.31% 27 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory the downstream decompressors should be able to realize they don't need to decompress the entire data buffer by looking at the offsets. This would avoid the offsets copy introduced here.

However, an offsets copy is relatively cheap (offsets will never be that large), and we are clearly doing the wrong thing now, so this PR is a step in the right direction.

Can we just file a follow-up to see if we can remove this copy sometime in the future?

Comment thread rust/lance-encoding/src/buffer.rs Outdated
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Feb 26, 2026

Can we just file a follow-up to see if we can remove this copy sometime in the future?

Sure, will do!

@Xuanwo Xuanwo merged commit 964ac0e into main Feb 26, 2026
28 checks passed
@Xuanwo Xuanwo deleted the xuanwo/laion10m-scan-full-fix branch February 26, 2026 16:45
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Feb 26, 2026

Filed a follow-up issue to investigate removing the per-batch offsets copy while preserving correctness and perf guarantees: #6027

wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 4, 2026
…e-format#6013)

`VariableFullZipDecoder::drain` used to pass the entire page-level
variable `data` buffer to every batch while slicing only the batch
offsets.
This broke the normalized variable-buffer assumption (offsets should be
0-based for the provided data window) and caused downstream variable
decompressors to repeatedly size work against oversized input buffers.

Bencmarks:

- scan-full: 34047us -> 15703us,2.168x
- caption: 28431us -> 6979us,4.074x

## Changes

- Add `LanceBuffer::slice_and_rebase_offsets` in
`rust/lance-encoding/src/buffer.rs`.
  - Slice data to the byte range referenced by the batch offsets.
  - Rebase offsets to start at zero.
  - Validate invalid/overflowing offsets.
- Update variable full-zip batch drain path to use the new helper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants