Skip to content

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891

Merged
Xuanwo merged 6 commits intomainfrom
luban/mind-into
Feb 25, 2026
Merged

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891
Xuanwo merged 6 commits intomainfrom
luban/mind-into

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Feb 5, 2026

This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead.


Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2026

Code Review

Summary

This PR changes the dictionary encoding decision logic from using pre-computed cardinality statistics to using a budget-based approach with max_dict_entries and max_encoded_size limits. It also adds early-out sampling to skip dictionary encoding on near-unique data.

P0 Issues

  1. Hash collision risk in sample_is_near_unique for variable-width data (primitive.rs:287-318)

    The function uses xxh3_64 hashes in a HashSet<u64> to estimate uniqueness of strings. Hash collisions will undercount unique values, potentially causing the function to incorrectly identify high-cardinality data as "not near-unique", leading to wasted CPU on dictionary encoding attempts. While xxh3 has excellent distribution, for a 4096-sample set the collision probability is non-negligible for data with actual high cardinality.

    Recommendation: This is acceptable for a sampling heuristic since false negatives (missing some unique data) just mean we try dictionary encoding and bail out. However, document this trade-off explicitly.

  2. Magic constants without documentation (primitive.rs:66-67)

    DEFAULT_SAMPLE_SIZE: usize = 4096 and DEFAULT_SAMPLE_UNIQUE_RATIO: f64 = 0.98 are introduced inside should_dictionary_encode. These affect encoding behavior but aren't documented or exposed as configurable.

    Recommendation: Move these to module-level constants with documentation explaining their purpose and tuning considerations.

P1 Issues

  1. Lazy cardinality computation may cause unexpected latency (statistics.rs:189-214)

    The change makes VariableWidthBlock::get_stat(Stat::Cardinality) lazily compute cardinality, acquiring a write lock inside what callers may expect to be a read-only operation. If other code paths depend on cardinality being pre-computed (e.g., for progress estimation or logging), they may experience unexpected blocking.

    Recommendation: Verify no other code paths rely on cardinality being eagerly computed. The test coverage seems adequate, but consider a brief audit.

  2. Potential division edge case (primitive.rs:181-183)

    let threshold_cardinality = num_values
        .checked_div(divisor.max(1))
        .unwrap_or(0)
        .min(max_cardinality);

    Using divisor.max(1) is good, but if someone sets LANCE_ENCODING_DICT_DIVISOR=0 via env var, this silently becomes 1 rather than erroring. This is likely fine but may surprise users expecting an error.

Positive Notes

  • Good use of budget-based approach that aborts dictionary encoding early if limits are exceeded
  • The sampling heuristic should reduce CPU waste on high-cardinality data
  • Tests cover the new budget limits well

Testing Suggestions

Consider adding a test that verifies dictionary encoding still works correctly when the sample suggests near-uniqueness but actual data has lower cardinality (edge case where sampling step misses repeated patterns).

@westonpace
Copy link
Copy Markdown
Member

Instead of cardinality, we will use dict entries and encoded size instead.

Does this mean we are always calculating the true cardinality (until we hit the limit) instead of HLL? Do we know what impact this has on write speeds?

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Feb 24, 2026

Does this mean we are always calculating the true cardinality (until we hit the limit) instead of HLL? Do we know what impact this has on write speeds?

Yes.

I did some local benchmark on simulated dataset:

  • str_unique: 17.579ms -> 13.378ms (-23.90%)
  • str_low_card_128: 15.786ms -> 11.952ms (-24.29%)
  • u64_unique: 2.744ms -> 2.698ms (-1.68%)
  • u64_low_card_128: 2.554ms -> 2.560ms (+0.23%)

I also run some test on the real data like Pride and Prejudice:

- pride_lines_r8192  +10.20%(2.9747ms -> 3.2782ms)

The unique ratio is 0.9796 which is so close to 0.98 but at the same time it's not good for dict, so we are doing some extra work which makes it slower.

I think in general it's good. But maybe you'll want to add config options to allow users to tune those settings?

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me, thanks for helping me understand. There is a clippy failure I think btw.

Comment thread rust/lance-encoding/src/encodings/logical/primitive/dict.rs
})
}

fn sample_is_near_unique(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document this. It's quite clever. Looks like you are randomly sampling 4096 values and testing them for uniqueness before you consider dictionary encoding further?

This is basically just a different way of doing cardinality estimation right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I pick values in systematic sampling (but yes, it's a bit randomly).

Comment on lines +199 to +213
if stat != Stat::Cardinality {
return None;
}

let computed = self.compute_cardinality();
let mut block_info = self.block_info.0.write().unwrap();
if block_info.is_empty() {
panic!("get_stat should be called after statistics are computed.");
}
block_info.get(&stat).cloned()
Some(
block_info
.entry(stat)
.or_insert_with(|| computed.clone())
.clone(),
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you stopped eagerly calculating HLL but you still calculate it on demand? Why is that? I think we should probably do this with all the stats but I'm curious why you didn't just get rid of HLL? Is there some path that still needs it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to minimize the code changes in this PR when I started this work. It is true that we could simply remove it. I will implement this in a follow-up PR.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@Xuanwo Xuanwo merged commit 861727c into main Feb 25, 2026
27 of 28 checks passed
@Xuanwo Xuanwo deleted the luban/mind-into branch February 25, 2026 18:47
wjones127 pushed a commit to wjones127/lance that referenced this pull request Feb 25, 2026
…r dict decision (lance-format#5891)

This PR changed how we decide to use dict or now. Instead of
cardinality, we will use dict entries and encoded size instead.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
wjones127 pushed a commit to wjones127/lance that referenced this pull request Feb 25, 2026
…r dict decision (lance-format#5891)

This PR changed how we decide to use dict or now. Instead of
cardinality, we will use dict entries and encoded size instead.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
wjones127 pushed a commit that referenced this pull request Feb 26, 2026
…r dict decision (#5891)

This PR changed how we decide to use dict or now. Instead of
cardinality, we will use dict entries and encoded size instead.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants