fix: too large data chunk generated by highly compressed yet nested data with RLE by Xuanwo · Pull Request #4431 · lance-format/lance

Xuanwo · 2025-08-11T15:12:19Z

As described in #4429, highly compressed yet nested data using RLE can produce data chunks that exceed our 16KiB threshold. This happens because our RLE encoding currently considers only the data buffer size and does not account for the size of REP/DEF markers, which can consume up to 4 bytes per value.

Ideally, we should include REP/DEF sizes in the calculation, but that would require significant changes. In this PR, I implemented a workaround to address the issue at the cost of a slightly lower compression ratio. A more comprehensive fix will follow after discussion.

This PR also includes a repro as part of our unit test to prevent regression of this bug.

Signed-off-by: Xuanwo <github@xuanwo.io>

codecov-commenter · 2025-08-11T16:42:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.90%. Comparing base (0c69144) to head (0c758c7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4431      +/-   ##
==========================================
+ Coverage   81.88%   81.90%   +0.02%     
==========================================
  Files         302      302              
  Lines      123146   123298     +152     
  Branches   123146   123298     +152     
==========================================
+ Hits       100839   100990     +151     
- Misses      18502    18506       +4     
+ Partials     3805     3802       -3

Flag	Coverage Δ
unittests	`81.90% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Xuanwo added 3 commits August 11, 2025 17:16

Add reproduce

5cab5ec

Signed-off-by: Xuanwo <github@xuanwo.io>

fix

463fe61

Signed-off-by: Xuanwo <github@xuanwo.io>

Merge branch 'main' into 65k-structs-in-one-row

554f3fa

github-actions Bot added the bug Something isn't working label Aug 11, 2025

Make clppy happy

0c758c7

Signed-off-by: Xuanwo <github@xuanwo.io>

BubbleCal approved these changes Aug 12, 2025

View reviewed changes

Xuanwo merged commit c9a60f5 into main Aug 12, 2025
30 checks passed

Xuanwo deleted the 65k-structs-in-one-row branch August 12, 2025 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: too large data chunk generated by highly compressed yet nested data with RLE#4431

fix: too large data chunk generated by highly compressed yet nested data with RLE#4431
Xuanwo merged 4 commits intomainfrom
65k-structs-in-one-row

Xuanwo commented Aug 11, 2025

Uh oh!

codecov-commenter commented Aug 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Xuanwo commented Aug 11, 2025

Uh oh!

codecov-commenter commented Aug 11, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants