Skip to content

feat: abort dictionary encode if not useful#5055

Merged
Xuanwo merged 13 commits intolance-format:mainfrom
yingjianwu98:abortDictionaryEncodeIfNotUseful
Jan 28, 2026
Merged

feat: abort dictionary encode if not useful#5055
Xuanwo merged 13 commits intolance-format:mainfrom
yingjianwu98:abortDictionaryEncodeIfNotUseful

Conversation

@yingjianwu98
Copy link
Copy Markdown
Contributor

@yingjianwu98 yingjianwu98 commented Oct 23, 2025

resolve #5006

Since during estimation, hll can be inaccurate, it is possible that the actual dictionary encoding does not yield much value.
This PR aborts dictionary encoding on flight when it finds not meaningful.

Please look at the last commit which contains the change.
The first 2 commits is working on making index all 32 bits, which is open in a separate PR.

@github-actions github-actions Bot added the enhancement New feature or request label Oct 23, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +135 to +175
let mut dictionary_buffer: Vec<u8> = Vec::with_capacity((max_len * cardinality) as usize);
let mut dictionary_offsets_buffer = vec![T::default()];
let mut curr_idx = 0;
let mut indices_buffer = Vec::with_capacity(variable_width_data_block.num_values as usize);

for window in offsets.windows(2) {
let start = usize::try_from(window[0]).ok().unwrap();
let end = usize::try_from(window[1]).ok().unwrap();

let key = &variable_width_data_block.data[start..end];

let idx: i32 = *map.entry(U8SliceKey(key)).or_insert_with(|| {
dictionary_buffer.extend_from_slice(key);
dictionary_offsets_buffer.push(T::from_usize(dictionary_buffer.len()).unwrap());
curr_idx += 1;
curr_idx - 1
});

indices_buffer.push(idx);

if dictionary_buffer.len()
+ indices_buffer.len() * 32 / 8
+ dictionary_offsets_buffer.len() * (bits_per_offset / 8) as usize
> variable_width_data_block.data_size() as usize
{
return None;
}
}

let dictionary_data_block = DataBlock::VariableWidth(VariableWidthBlock {
data: LanceBuffer::reinterpret_vec(dictionary_buffer),
offsets: LanceBuffer::reinterpret_vec(dictionary_offsets_buffer),
bits_per_offset,
num_values: curr_idx as u64,
block_info: BlockInfo::default(),
});

let mut indices_data_block = DataBlock::FixedWidth(FixedWidthDataBlock {
data: LanceBuffer::reinterpret_vec(indices_buffer),
bits_per_value: DICT_INDICES_BITS_PER_VALUE,
num_values: variable_width_data_block.num_values,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid truncating indices for 64-bit variable-width dictionaries

The new dict_encode_variable_width helper always builds indices_buffer with i32 values and tracks curr_idx as i32. The previous implementation emitted 64‑bit indices whenever the source data used 64‑bit offsets to accommodate more than 2^31 distinct values. With the current code, encoding a column that legitimately requires 64‑bit offsets but has more than ~2.1B unique entries will overflow curr_idx and produce invalid 32‑bit indices (or panic in debug), corrupting the encoded output. The index width should depend on bits_per_offset or cardinality, as it did before, instead of being hard-coded to 32 bits.

Useful? React with 👍 / 👎.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 90.41667% with 23 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...e-encoding/src/encodings/logical/primitive/dict.rs 90.42% 15 Missing and 3 partials ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 89.36% 1 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo added 10 commits January 22, 2026 19:55
…IfNotUseful

# Conflicts:
#	rust/lance-encoding/src/encodings/logical/primitive.rs
#	rust/lance-encoding/src/encodings/logical/primitive/dict.rs
…IfNotUseful

# Conflicts:
#	rust/lance-encoding/src/encodings/logical/primitive.rs
#	rust/lance-encoding/src/encodings/logical/primitive/dict.rs
…IfNotUseful

# Conflicts:
#	rust/lance-encoding/src/encodings/logical/primitive.rs
@Xuanwo Xuanwo merged commit 62d878e into lance-format:main Jan 28, 2026
29 checks passed
vivek-bharathan pushed a commit to vivek-bharathan/lance that referenced this pull request Feb 2, 2026
resolve lance-format#5006

Since during estimation, hll can be inaccurate, it is possible that the
actual dictionary encoding does not yield much value.
This PR aborts dictionary encoding on flight when it finds not
meaningful.

Please look at the last commit which contains the change. 
The first 2 commits is working on making index all 32 bits, which is
open in a separate PR.

---------

Co-authored-by: stevie9868 <yingjianwu2@email.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Abort dictionary encoding if HLL estimation turned out to be incorrect

3 participants