refactor: use dict entries and encoded size instead of cardinality for dict decision by Xuanwo · Pull Request #5891 · lance-format/lance

Xuanwo · 2026-02-05T11:26:51Z

This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead.

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

…r dict decision

github-actions · 2026-02-05T11:27:55Z

Code Review

Summary

This PR changes the dictionary encoding decision logic from using pre-computed cardinality statistics to using a budget-based approach with max_dict_entries and max_encoded_size limits. It also adds early-out sampling to skip dictionary encoding on near-unique data.

P0 Issues

Hash collision risk in sample_is_near_unique for variable-width data (primitive.rs:287-318)

The function uses xxh3_64 hashes in a HashSet<u64> to estimate uniqueness of strings. Hash collisions will undercount unique values, potentially causing the function to incorrectly identify high-cardinality data as "not near-unique", leading to wasted CPU on dictionary encoding attempts. While xxh3 has excellent distribution, for a 4096-sample set the collision probability is non-negligible for data with actual high cardinality.

Recommendation: This is acceptable for a sampling heuristic since false negatives (missing some unique data) just mean we try dictionary encoding and bail out. However, document this trade-off explicitly.
Magic constants without documentation (primitive.rs:66-67)

DEFAULT_SAMPLE_SIZE: usize = 4096 and DEFAULT_SAMPLE_UNIQUE_RATIO: f64 = 0.98 are introduced inside should_dictionary_encode. These affect encoding behavior but aren't documented or exposed as configurable.

Recommendation: Move these to module-level constants with documentation explaining their purpose and tuning considerations.

P1 Issues

Lazy cardinality computation may cause unexpected latency (statistics.rs:189-214)

The change makes VariableWidthBlock::get_stat(Stat::Cardinality) lazily compute cardinality, acquiring a write lock inside what callers may expect to be a read-only operation. If other code paths depend on cardinality being pre-computed (e.g., for progress estimation or logging), they may experience unexpected blocking.

Recommendation: Verify no other code paths rely on cardinality being eagerly computed. The test coverage seems adequate, but consider a brief audit.
Potential division edge case (primitive.rs:181-183)
```
let threshold_cardinality = num_values
    .checked_div(divisor.max(1))
    .unwrap_or(0)
    .min(max_cardinality);
```
Using divisor.max(1) is good, but if someone sets LANCE_ENCODING_DICT_DIVISOR=0 via env var, this silently becomes 1 rather than erroring. This is likely fine but may surprise users expecting an error.

Positive Notes

Good use of budget-based approach that aborts dictionary encoding early if limits are exceeded
The sampling heuristic should reduce CPU waste on high-cardinality data
Tests cover the new budget limits well

Testing Suggestions

Consider adding a test that verifies dictionary encoding still works correctly when the sample suggests near-uniqueness but actual data has lower cardinality (edge case where sampling step misses repeated patterns).

…e-main-clean

westonpace · 2026-02-23T14:02:21Z

Instead of cardinality, we will use dict entries and encoded size instead.

Does this mean we are always calculating the true cardinality (until we hit the limit) instead of HLL? Do we know what impact this has on write speeds?

codecov · 2026-02-24T04:07:00Z

Codecov Report

❌ Patch coverage is 87.24584% with 69 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../lance-encoding/src/encodings/logical/primitive.rs	82.31%	17 Missing and 12 partials ⚠️
rust/lance/src/dataset/write/delete.rs	80.35%	5 Missing and 6 partials ⚠️
rust/lance/src/io/commit/conflict_resolver.rs	38.46%	0 Missing and 8 partials ⚠️
...st/lance-encoding/src/encodings/physical/packed.rs	92.63%	6 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs	93.22%	2 Missing and 2 partials ⚠️
rust/lance-namespace-impls/src/dir.rs	0.00%	3 Missing ⚠️
rust/lance/src/index/mem_wal.rs	50.00%	0 Missing and 3 partials ⚠️
rust/lance-encoding/src/statistics.rs	93.54%	2 Missing ⚠️
rust/lance/src/io/commit.rs	0.00%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo · 2026-02-24T04:41:41Z

Does this mean we are always calculating the true cardinality (until we hit the limit) instead of HLL? Do we know what impact this has on write speeds?

Yes.

I did some local benchmark on simulated dataset:

str_unique: 17.579ms -> 13.378ms (-23.90%)
str_low_card_128: 15.786ms -> 11.952ms (-24.29%)
u64_unique: 2.744ms -> 2.698ms (-1.68%)
u64_low_card_128: 2.554ms -> 2.560ms (+0.23%)

I also run some test on the real data like Pride and Prejudice:

- pride_lines_r8192  +10.20%（2.9747ms -> 3.2782ms）

The unique ratio is 0.9796 which is so close to 0.98 but at the same time it's not good for dict, so we are doing some extra work which makes it slower.

I think in general it's good. But maybe you'll want to add config options to allow users to tune those settings?

westonpace

This works for me, thanks for helping me understand. There is a clippy failure I think btw.

westonpace · 2026-02-25T12:38:07Z

+        })
+    }
+
+    fn sample_is_near_unique(


Let's document this. It's quite clever. Looks like you are randomly sampling 4096 values and testing them for uniqueness before you consider dictionary encoding further?

This is basically just a different way of doing cardinality estimation right?

Yes, and I pick values in systematic sampling (but yes, it's a bit randomly).

westonpace · 2026-02-25T12:42:08Z

+        if stat != Stat::Cardinality {
+            return None;
+        }
+
+        let computed = self.compute_cardinality();
+        let mut block_info = self.block_info.0.write().unwrap();
        if block_info.is_empty() {
            panic!("get_stat should be called after statistics are computed.");
        }
-        block_info.get(&stat).cloned()
+        Some(
+            block_info
+                .entry(stat)
+                .or_insert_with(|| computed.clone())
+                .clone(),
+        )


It looks like you stopped eagerly calculating HLL but you still calculate it on demand? Why is that? I think we should probably do this with all the stats but I'm curious why you didn't just get rid of HLL? Is there some path that still needs it?

I want to minimize the code changes in this PR when I started this work. It is true that we could simply remove it. I will implement this in a follow-up PR.

…5891-followup

github-actions · 2026-02-25T17:40:40Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

…r dict decision (lance-format#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**

…r dict decision (#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**

refactor: use dict entries and encoded size instead of cardinality fo…

16b348d

…r dict decision

Merge remote-tracking branch 'upstream/main' into xuanwo/pr-5891-merg…

d90ca3d

…e-main-clean

Merge branch 'main' into luban/mind-into

de1d682

westonpace approved these changes Feb 25, 2026

View reviewed changes

Xuanwo added 3 commits February 26, 2026 01:17

Merge remote-tracking branch 'origin/main' into xuanwo/pr-5891-followup

2c6b7b6

fix: resolve clippy issues and clarify dict sampling

886cca5

Merge remote-tracking branch 'origin/luban/mind-into' into xuanwo/pr-…

9e87b1b

…5891-followup

Xuanwo merged commit 861727c into main Feb 25, 2026
27 of 28 checks passed

Xuanwo deleted the luban/mind-into branch February 25, 2026 18:47

Xuanwo mentioned this pull request Feb 26, 2026

refactor(encoding): remove cardinality stats from dict decisions #6022

Open

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891
Xuanwo merged 6 commits intomainfrom
luban/mind-into

Xuanwo commented Feb 5, 2026

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

westonpace commented Feb 23, 2026

Uh oh!

codecov Bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Xuanwo commented Feb 24, 2026 •

edited

Loading

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

westonpace Feb 25, 2026

Uh oh!

Xuanwo Feb 25, 2026

Uh oh!

westonpace Feb 25, 2026

Uh oh!

Xuanwo Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xuanwo commented Feb 5, 2026

Uh oh!

github-actions Bot commented Feb 5, 2026

Code Review

Summary

P0 Issues

P1 Issues

Positive Notes

Testing Suggestions

Uh oh!

westonpace commented Feb 23, 2026

Uh oh!

codecov Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xuanwo commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Feb 24, 2026 •

edited

Loading

Xuanwo commented Feb 24, 2026 •

edited

Loading