Skip to content

perf: calculate cardinality lazily#5783

Merged
Xuanwo merged 2 commits intomainfrom
luban/fix-bitmap-btree-build-t
Jan 27, 2026
Merged

perf: calculate cardinality lazily#5783
Xuanwo merged 2 commits intomainfrom
luban/fix-bitmap-btree-build-t

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Jan 22, 2026

This PR will fix #5714.

Cardinality calculation is slow, but we only need it when building the dictionary. This PR changes the calculation to a lazy approach.

BITMAP 5M low-float:229.725 (base) → 264.306 (bad) → 215.792 (good)
BTREE 5M low-float:216.628 (base) → 340.683 (bad) → 222.974 (good)

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

Summary: The PR correctly implements lazy cardinality calculation to fix the performance regression in BITMAP/BTREE index building. The approach is sound.

P1: Minor inefficiency in get_stat implementation

In FixedWidthDataBlock::get_stat(), the computed cardinality is cloned twice unnecessarily:

let computed = self.cardinality();
let mut block_info = self.block_info.0.write().unwrap();
Some(
    block_info
        .entry(stat)
        .or_insert_with(|| computed.clone())  // clone here
        .clone(),  // and clone again here
)

Since computed is moved into the closure scope, the first .clone() inside or_insert_with is unnecessary when the entry doesn't exist. Consider:

let computed = self.cardinality();
let mut block_info = self.block_info.0.write().unwrap();
Some(block_info.entry(stat).or_insert(computed).clone())

This avoids one clone when the stat wasn't already present. The entry API's or_insert will only insert computed if the key is vacant, so it's equivalent but slightly more efficient.

Note on race condition (non-issue)

The pattern of releasing the read lock before acquiring the write lock could theoretically allow another thread to compute the same stat concurrently. However, this is benign:

  1. The entry().or_insert_with() pattern ensures only one computation is stored
  2. Duplicate computations would produce identical results
  3. The performance win from avoiding unnecessary computation on the happy path outweighs rare double-computation

Tests: The two new tests adequately verify the lazy behavior.

Verdict: LGTM with the minor optimization suggestion above.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/statistics.rs 88.88% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. If the cardinality check is a threshold, I wonder if we should early return. For example, if we have a rule that says "dictionary encode if cardinality > 100", then maybe we can just stop computing cardinality as soon as our estimate is above 100.

@wjones127 wjones127 self-assigned this Jan 23, 2026
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Jan 26, 2026

@wjones127, Thanks! Makes sense. In our case the dictionary decision is based on estimated size ratio, so the “cardinality threshold” isn’t a fixed constant, but we could derive an upper bound for cardinality from the ratio and early-exit the HLL once the estimate exceeds that bound.

Would you be OK if I keep this PR focused on fixing the regression (lazy cardinality), and do the early-exit logic as a follow-up PR with benchmarks to validate the impact?

@wjones127
Copy link
Copy Markdown
Contributor

Would you be OK if I keep this PR focused on fixing the regression (lazy cardinality), and do the early-exit logic as a follow-up PR with benchmarks to validate the impact?

Yeah, definitely do that as a follow up if you want to do it. I think this PR is already valuable as is!

@Xuanwo Xuanwo merged commit 3044051 into main Jan 27, 2026
28 checks passed
@Xuanwo Xuanwo deleted the luban/fix-bitmap-btree-build-t branch January 27, 2026 12:11
vivek-bharathan pushed a commit to vivek-bharathan/lance that referenced this pull request Feb 2, 2026
This PR will fix lance-format#5714.

Cardinality calculation is slow, but we only need it when building the
dictionary. This PR changes the calculation to a lazy approach.

```shell
BITMAP 5M low-float:229.725 (base) → 264.306 (bad) → 215.792 (good)
BTREE 5M low-float:216.628 (base) → 340.683 (bad) → 222.974 (good)
```

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
@tomtomwombat
Copy link
Copy Markdown

You may want to consider using https://crates.io/crates/hyperloglockless for the HyperLogLogPlus. It has O(1) cardinality queries and very fast inserts. It's also more accurate. I believe it can be a drop-in replacement.
Also, because you are using a low precision of 4, the HLL++ variant would be unnecessary since it immediately switches to "dense" HLL representation at very few items (wasting time on HLL++'s sparse -> dense conversion).

Disclaimer: I am the author of hyperloglockless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build time regression for BITMAP and BTREE index

3 participants