feat: abort dictionary encode if not useful#5055
Conversation
…ion we did rely on hll which sometimes are not accurate add test
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let mut dictionary_buffer: Vec<u8> = Vec::with_capacity((max_len * cardinality) as usize); | ||
| let mut dictionary_offsets_buffer = vec![T::default()]; | ||
| let mut curr_idx = 0; | ||
| let mut indices_buffer = Vec::with_capacity(variable_width_data_block.num_values as usize); | ||
|
|
||
| for window in offsets.windows(2) { | ||
| let start = usize::try_from(window[0]).ok().unwrap(); | ||
| let end = usize::try_from(window[1]).ok().unwrap(); | ||
|
|
||
| let key = &variable_width_data_block.data[start..end]; | ||
|
|
||
| let idx: i32 = *map.entry(U8SliceKey(key)).or_insert_with(|| { | ||
| dictionary_buffer.extend_from_slice(key); | ||
| dictionary_offsets_buffer.push(T::from_usize(dictionary_buffer.len()).unwrap()); | ||
| curr_idx += 1; | ||
| curr_idx - 1 | ||
| }); | ||
|
|
||
| indices_buffer.push(idx); | ||
|
|
||
| if dictionary_buffer.len() | ||
| + indices_buffer.len() * 32 / 8 | ||
| + dictionary_offsets_buffer.len() * (bits_per_offset / 8) as usize | ||
| > variable_width_data_block.data_size() as usize | ||
| { | ||
| return None; | ||
| } | ||
| } | ||
|
|
||
| let dictionary_data_block = DataBlock::VariableWidth(VariableWidthBlock { | ||
| data: LanceBuffer::reinterpret_vec(dictionary_buffer), | ||
| offsets: LanceBuffer::reinterpret_vec(dictionary_offsets_buffer), | ||
| bits_per_offset, | ||
| num_values: curr_idx as u64, | ||
| block_info: BlockInfo::default(), | ||
| }); | ||
|
|
||
| let mut indices_data_block = DataBlock::FixedWidth(FixedWidthDataBlock { | ||
| data: LanceBuffer::reinterpret_vec(indices_buffer), | ||
| bits_per_value: DICT_INDICES_BITS_PER_VALUE, | ||
| num_values: variable_width_data_block.num_values, |
There was a problem hiding this comment.
Avoid truncating indices for 64-bit variable-width dictionaries
The new dict_encode_variable_width helper always builds indices_buffer with i32 values and tracks curr_idx as i32. The previous implementation emitted 64‑bit indices whenever the source data used 64‑bit offsets to accommodate more than 2^31 distinct values. With the current code, encoding a column that legitimately requires 64‑bit offsets but has more than ~2.1B unique entries will overflow curr_idx and produce invalid 32‑bit indices (or panic in debug), corrupting the encoded output. The index width should depend on bits_per_offset or cardinality, as it did before, instead of being hard-coded to 32 bits.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…IfNotUseful # Conflicts: # rust/lance-encoding/src/encodings/logical/primitive.rs # rust/lance-encoding/src/encodings/logical/primitive/dict.rs
…IfNotUseful # Conflicts: # rust/lance-encoding/src/encodings/logical/primitive.rs # rust/lance-encoding/src/encodings/logical/primitive/dict.rs
…l' into abortDictionaryEncodeIfNotUseful
…IfNotUseful # Conflicts: # rust/lance-encoding/src/encodings/logical/primitive.rs
resolve lance-format#5006 Since during estimation, hll can be inaccurate, it is possible that the actual dictionary encoding does not yield much value. This PR aborts dictionary encoding on flight when it finds not meaningful. Please look at the last commit which contains the change. The first 2 commits is working on making index all 32 bits, which is open in a separate PR. --------- Co-authored-by: stevie9868 <yingjianwu2@email.com> Co-authored-by: Xuanwo <github@xuanwo.io>
resolve #5006
Since during estimation, hll can be inaccurate, it is possible that the actual dictionary encoding does not yield much value.
This PR aborts dictionary encoding on flight when it finds not meaningful.
Please look at the last commit which contains the change.
The first 2 commits is working on making index all 32 bits, which is open in a separate PR.