feat: add dictionary encoding for 64bit types like int64/double#5594
feat: add dictionary encoding for 64bit types like int64/double#5594
Conversation
Code ReviewSummaryThis PR adds dictionary encoding support for 64-bit types (int64/double), extending the existing 128-bit support. The compression gains look good based on the benchmarks provided. P0/P1 IssuesP1: Potential correctness issue with float64 dictionary encoding Using
This may or may not be intentional behavior. If you want IEEE semantics where P1: Whitespace error in line 4425 There's an indentation issue at line 4425 in primitive.rs - the line has extra leading spaces that break consistent formatting: let task = spawn_cpu(move || {
- let num_values = arrays.iter().map(|arr| arr.len() as u64).sum();
+ let num_values = arrays.iter().map(|arr| arr.len() as u64).sum();This should be fixed to maintain consistent formatting (run Minor Observations
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…e-format#5594) This PR will introduce dictionary encoding for 64bit types like int64/double. | Field | Parquet (bytes) | Lance 2.1 (bytes) | Lance 2.2 before (bytes) | Lance 2.2 after (bytes) | vs Parquet | vs Lance 2.1 | vs Lance 2.2 before | |---|---:|---:|---:|---:|---:|---:|---:| | `token_count` | 806,050 | 350,852 | 351,208 | 312,168 | -493,882 (-61.3%) | -38,684 (-11.0%) | -39,040 (-11.1%) | | `score` | 254,145 | 1,438,596 | 1,438,952 | 164,048 | -90,097 (-35.5%) | -1,274,548 (-88.6%) | -1,274,904 (-88.6%) | --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This PR will introduce dictionary encoding for 64bit types like int64/double.
token_countscoreParts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.