refactor: allow switching to bitpack inside RLE#5595
Conversation
Code ReviewSummary: This PR adds logic to prefer bitpacking over RLE when bitpacking produces smaller output. The approach is sound and the test coverage is good. P1 Issue: Estimation formula may underestimate bitpacking sizeIn Looking at let words_per_chunk: u128 = 1;
let word_bytes: u128 = (bits / 8) as u128;
// ...
let packed_words = (1024u128 * bit_width) / (bits as u128);
total_words = total_words.saturating_add(words_per_chunk.saturating_add(packed_words));This correctly accounts for the header as 1 word (element) per chunk plus the packed data words. However, the comparison should be
This seems correct, but I'd suggest adding a brief comment in Suggestion (not blocking)Consider adding a comment in the RLE function explaining the fallback to bitpacking check: // If bitpacking would produce smaller output than RLE, skip RLE
// and let the subsequent bitpacking check handle compression.Test coverageThe new test LGTM with the optional suggestion above. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change. | Metric | Parquet (reference) | Lance (before change) | Lance (after change) | Delta (after vs before) | |---|---:|---:|---:|---:| | `int_score` compressed size (bytes) | 56,035 | 377,838 | 71,556 | -306,282 (-81.06%) | | `int_score` vs Parquet (ratio) | 1.00x | 6.74x | 1.28x | -5.47x | | Lance chosen encoding (hint) | `RLE_DICTIONARY` (plus `RLE`, `PLAIN`, `SNAPPY`) | `rle` | `inline_bitpacking` | n/a | --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change.
int_scorecompressed size (bytes)int_scorevs Parquet (ratio)RLE_DICTIONARY(plusRLE,PLAIN,SNAPPY)rleinline_bitpackingParts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.