-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement bitmap_distinct function using roaring bitmap #1841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e-dard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty cool @Ted-Jiang! Are you planning on using this functionality in a project using Datafusion?
Some thoughts:
- Nice speedup on these distinct counts for integers.
- I wonder if/how this gets things closer to being able to do distinct on compressed data (in DF's case on dictionary encoded columns). The problem (as I understand it) is that there is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches, or even in the same record batch (if I remember how dictionary concatenation currently works in Arrow).
- Would this work on 64-bit columns if they could first be casted to 32-bit? That is, assuming the contents of the 64-bit column actually fit as 32-bit unsigned integers?
| impl Accumulator for BitmapDistinctCountAccumulator { | ||
| //state() can be used by physical nodes to aggregate states together and send them over the network/threads, to combine values. | ||
| fn state(&self) -> Result<Vec<ScalarValue>> { | ||
| //maybe run optimized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit of a tricky question because the bitmap only lives for the duration of the query right? One of the main things that can be optimised is to convert array containers into run containers (which use RLE compression), so it seems likely that it depends on the contents of the input. A benchmark with typical data might help answer this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, when meeting large amount data, run optimized may extremely reduce shuffle data between thread or process(ballista), it may reduce IO cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the map size then decide whether run optimized
| Ok(()) | ||
| } | ||
|
|
||
| fn evaluate(&self) -> Result<ScalarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this generally only gets called once per query? Only asking because cardinality is quite an expensive call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry , is there something i miss ? Are there more efficient API, to get the number of integers contained in the bitmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is the only way. I was just checking it is only called once for the final result.
IMO, it will lose front 32 bit info, the result will be incorrect. |
|
@alamb i found in CI I think we should already depend on clang, |
78f8b36 to
8b23db1
Compare
datafusion/Cargo.toml
Outdated
| pyo3 = { version = "0.15", optional = true } | ||
| tempfile = "3" | ||
| parking_lot = "0.12" | ||
| croaring = "0.5.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat worried about adding this dependency to datafusion (at least as a required dependency). Not only will it increase the requirements / compile time for datafusion, it also complicates the build process (as now one needs to install clang to build datafusion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Thanks for your reply. I agree It 's complicated for add a new dependency clang into datafusion.
IMO, we still need an high-efficiency bitmap implement roarling-bitmap, like IOx and other high-end OLAP systems to do some optimize like late materialization. Maybe we should change croaring to roarling-rs (completely written in rust, won't cost too mush compile time, i also find datafusion compile\build parallel ), maybe one day it will catch up with croaring 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @alamb , would be good if we can either make this dependency optional or change to a pure rust implementation.
.github/workflows/rust.yml
Outdated
| rustup toolchain install ${{ matrix.rust }} | ||
| rustup default ${{ matrix.rust }} | ||
| rustup component add rustfmt | ||
| - name: Set up Clang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general apache projects try not to use "third-party" github actions (aka actions not hosted by github or apache itself) so I am worried about using this action
27c468a to
dcb8601
Compare
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
| } | ||
|
|
||
| pub(crate) fn is_bitmap_count_distinct_supported_arg_type(arg_type: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support U64, I64 or other numeric type?
@Ted-Jiang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liukun4515 AFAIK, roarling-rs only support insert U32, I32. If we cast u64 to u32 may cause incorrect result.
liukun4515
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ab9907d to
ea33bc8
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @Ted-Jiang
I have a few concerns:
- The new dependency on
roaring - The handling of null values
- This may make the user experience less great;
What I mean by the user experience less great is that if we added bitmap_distinct users will have to decide if they should call distinct or bitmap_distinct. This choice which seems non ideal because they now have to understand the tradeoffs (like limited type support)
Would it be possible for DataFusion to automatically chose to use the roaring bitmap implementation as part of the implementation of the normal distinct aggregate? In other words, automatically use roaring for the supported datatypes and fall back to a slower but more general implementation for unsupported datatypes?
datafusion-physical-expr/Cargo.toml
Outdated
| paste = "^1.0" | ||
| ahash = { version = "0.7", default-features = false } | ||
| ordered-float = "2.10" | ||
| roaring = "0.8.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about making this an optional dependency (much like crypto expressions, etc) as defined on master?
This would let anyone who wants this feature be able to use it, but would not require it for anyone who did not?
| match value.data_type() { | ||
| DataType::Int8 => { | ||
| let array = value.as_any().downcast_ref::<Int8Array>().unwrap(); | ||
| for i in 0..array.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code doesn't seem to handle nulls (as in don't you have to check array.is_valid(i) prior to getting array.value()?
A test case for null values would probably be useful
Maybe you could use an iterator like
for value in array.iter() {
match value {
Some(v) => self.bitmap.insert(value as u32);
None => // do something with NULLs here
}
}|
FYI https://github.com/RoaringBitmap/roaring-rs/releases/tag/v0.9.0 was just released (with big performance optimizations) |
@Dandandan Thanks for your info, maybe i will redo the benchmark. |
| @@ -0,0 +1,229 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[brainstorming] how about we combine this to ApproxDistinct and use BitmapDistinctCountAccumulator for int8, int16 and int32 if the feature is avilable?
And use NumericHLLAccumulator for int64 and other non-int types. This way, user just need declare approx_distinct and rely on Datafusion to auto select the best approximate algorithm
unrelate notes: as a user, I do want to keep count(distinct) as exact count and approx_distinct as approximation
|
marking as draft (so it is easer to see what PRs are waiting for review) |
|
Closing this PR since it has not been updated in a long time. Feel free to re-open if this is still being worked on. @Ted-Jiang |
Which issue does this PR close?
Closes #1823.
Rationale for this change
test result:
So, choose croaring-rs bitmap for more efficient.
What changes are included in this PR?
Add new
aggregate_functioncalled bitmap_distinctAre there any user-facing changes?
Maybe we can add one optimize rule to like
approx_median(), changecount(distinct x)tobitmap_distinctin future