refactor: use dict entries and encoded size instead of cardinality for dict decision#5891
refactor: use dict entries and encoded size instead of cardinality for dict decision#5891
Conversation
Code ReviewSummaryThis PR changes the dictionary encoding decision logic from using pre-computed cardinality statistics to using a budget-based approach with P0 Issues
P1 Issues
Positive Notes
Testing SuggestionsConsider adding a test that verifies dictionary encoding still works correctly when the sample suggests near-uniqueness but actual data has lower cardinality (edge case where sampling step misses repeated patterns). |
Does this mean we are always calculating the true cardinality (until we hit the limit) instead of HLL? Do we know what impact this has on write speeds? |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Yes. I did some local benchmark on simulated dataset:
I also run some test on the real data like The unique ratio is I think in general it's good. But maybe you'll want to add config options to allow users to tune those settings? |
westonpace
left a comment
There was a problem hiding this comment.
This works for me, thanks for helping me understand. There is a clippy failure I think btw.
| }) | ||
| } | ||
|
|
||
| fn sample_is_near_unique( |
There was a problem hiding this comment.
Let's document this. It's quite clever. Looks like you are randomly sampling 4096 values and testing them for uniqueness before you consider dictionary encoding further?
This is basically just a different way of doing cardinality estimation right?
There was a problem hiding this comment.
Yes, and I pick values in systematic sampling (but yes, it's a bit randomly).
| if stat != Stat::Cardinality { | ||
| return None; | ||
| } | ||
|
|
||
| let computed = self.compute_cardinality(); | ||
| let mut block_info = self.block_info.0.write().unwrap(); | ||
| if block_info.is_empty() { | ||
| panic!("get_stat should be called after statistics are computed."); | ||
| } | ||
| block_info.get(&stat).cloned() | ||
| Some( | ||
| block_info | ||
| .entry(stat) | ||
| .or_insert_with(|| computed.clone()) | ||
| .clone(), | ||
| ) |
There was a problem hiding this comment.
It looks like you stopped eagerly calculating HLL but you still calculate it on demand? Why is that? I think we should probably do this with all the stats but I'm curious why you didn't just get rid of HLL? Is there some path that still needs it?
There was a problem hiding this comment.
I want to minimize the code changes in this PR when I started this work. It is true that we could simply remove it. I will implement this in a follow-up PR.
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
…r dict decision (lance-format#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
…r dict decision (lance-format#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
…r dict decision (#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead.
Parts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.