Skip to content

Conversation

@Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #1823.

Rationale for this change

test result:

1million_1million.parquet

+----------------------------+
| COUNT(DISTINCT test.value) |
+----------------------------+
| 631504                     |
+----------------------------+
1 row in set. Query took 1.225 seconds.


+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 631504                          |
+---------------------------------+·
1 row in set. Query took 0.175 seconds(roaring-rs).


+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 631504                          |
+---------------------------------+
1 row in set. Query took 0.052 seconds (croaring-rs).


+----------------------------+
| APPROXDISTINCT(test.value) |
+----------------------------+
| 630261                     |
+----------------------------+
1 row in set. Query took 0.052 seconds.

So, choose croaring-rs bitmap for more efficient.

What changes are included in this PR?

Add new aggregate_function called bitmap_distinct

Are there any user-facing changes?

Maybe we can add one optimize rule to like approx_median(), change count(distinct x) to bitmap_distinct in future

Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty cool @Ted-Jiang! Are you planning on using this functionality in a project using Datafusion?

Some thoughts:

  • Nice speedup on these distinct counts for integers.
  • I wonder if/how this gets things closer to being able to do distinct on compressed data (in DF's case on dictionary encoded columns). The problem (as I understand it) is that there is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches, or even in the same record batch (if I remember how dictionary concatenation currently works in Arrow).
  • Would this work on 64-bit columns if they could first be casted to 32-bit? That is, assuming the contents of the 64-bit column actually fit as 32-bit unsigned integers?

impl Accumulator for BitmapDistinctCountAccumulator {
//state() can be used by physical nodes to aggregate states together and send them over the network/threads, to combine values.
fn state(&self) -> Result<Vec<ScalarValue>> {
//maybe run optimized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit of a tricky question because the bitmap only lives for the duration of the query right? One of the main things that can be optimised is to convert array containers into run containers (which use RLE compression), so it seems likely that it depends on the contents of the input. A benchmark with typical data might help answer this.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, when meeting large amount data, run optimized may extremely reduce shuffle data between thread or process(ballista), it may reduce IO cost.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the map size then decide whether run optimized

Ok(())
}

fn evaluate(&self) -> Result<ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this generally only gets called once per query? Only asking because cardinality is quite an expensive call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry , is there something i miss ? Are there more efficient API, to get the number of integers contained in the bitmap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is the only way. I was just checking it is only called once for the final result.

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Feb 17, 2022

  • I wonder if/how this gets things closer to being able to do distinct on compressed data (in DF's case on dictionary encoded columns). The problem (as I understand it) is that there is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches, or even in the same record batch (if I remember how dictionary concatenation currently works in Arrow).

There is no guarantee that Arrow dictionaries have the same encoded representation for a value across batches : yes
For no-int col: We plan to maintain a global dictionary to encode col(string) into 32-bit int to accelerate count distinct.

  • Would this work on 64-bit columns if they could first be casted to 32-bit? That is, assuming the contents of the 64-bit column actually fit as 32-bit unsigned integers?

IMO, it will lose front 32 bit info, the result will be incorrect.

@Ted-Jiang
Copy link
Member Author

@alamb i found in CI

 --- stderr
  thread 'main' panicked at 'Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*', 'libclang-*.so.*'], set the `LIBCLANG_PATH` environment variable to a path where one of these files can be found (invalid: [])"', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/bindgen-0.59.2/src/lib.rs:2144:31

I think we should already depend on clang,
Do you know how to add this env?
IMO, like croaring only need install llvm in windows env

@Ted-Jiang Ted-Jiang force-pushed the arrow-bitmap branch 6 times, most recently from 78f8b36 to 8b23db1 Compare February 17, 2022 12:53
pyo3 = { version = "0.15", optional = true }
tempfile = "3"
parking_lot = "0.12"
croaring = "0.5.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat worried about adding this dependency to datafusion (at least as a required dependency). Not only will it increase the requirements / compile time for datafusion, it also complicates the build process (as now one needs to install clang to build datafusion)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for your reply. I agree It 's complicated for add a new dependency clang into datafusion.
IMO, we still need an high-efficiency bitmap implement roarling-bitmap, like IOx and other high-end OLAP systems to do some optimize like late materialization. Maybe we should change croaring to roarling-rs (completely written in rust, won't cost too mush compile time, i also find datafusion compile\build parallel ), maybe one day it will catch up with croaring 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alamb , would be good if we can either make this dependency optional or change to a pure rust implementation.

rustup toolchain install ${{ matrix.rust }}
rustup default ${{ matrix.rust }}
rustup component add rustfmt
- name: Set up Clang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general apache projects try not to use "third-party" github actions (aka actions not hosted by github or apache itself) so I am worried about using this action

@Ted-Jiang Ted-Jiang force-pushed the arrow-bitmap branch 2 times, most recently from 27c468a to dcb8601 Compare February 21, 2022 02:55
@Ted-Jiang
Copy link
Member Author

@houqp @alamb change to the pure rust implementation bitmap. plz review😊

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@houqp houqp added the enhancement New feature or request label Feb 21, 2022
@Ted-Jiang Ted-Jiang changed the title Implement bitmap_distinct function using croaring-rs bitmap Implement bitmap_distinct function using roaring bitmap Feb 21, 2022
}
}

pub(crate) fn is_bitmap_count_distinct_supported_arg_type(arg_type: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support U64, I64 or other numeric type?
@Ted-Jiang

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liukun4515 AFAIK, roarling-rs only support insert U32, I32. If we cast u64 to u32 may cause incorrect result.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@houqp houqp requested a review from alamb February 22, 2022 02:12
@Ted-Jiang Ted-Jiang force-pushed the arrow-bitmap branch 2 times, most recently from ab9907d to ea33bc8 Compare February 28, 2022 04:01
@Ted-Jiang
Copy link
Member Author

@jimexist @houqp Have fixed the conflict. If there are no more issues here, could you please merge this to avoid conflict again😊.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @Ted-Jiang

I have a few concerns:

  1. The new dependency on roaring
  2. The handling of null values
  3. This may make the user experience less great;

What I mean by the user experience less great is that if we added bitmap_distinct users will have to decide if they should call distinct or bitmap_distinct. This choice which seems non ideal because they now have to understand the tradeoffs (like limited type support)

Would it be possible for DataFusion to automatically chose to use the roaring bitmap implementation as part of the implementation of the normal distinct aggregate? In other words, automatically use roaring for the supported datatypes and fall back to a slower but more general implementation for unsupported datatypes?

paste = "^1.0"
ahash = { version = "0.7", default-features = false }
ordered-float = "2.10"
roaring = "0.8.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about making this an optional dependency (much like crypto expressions, etc) as defined on master?

This would let anyone who wants this feature be able to use it, but would not require it for anyone who did not?

https://github.com/apache/arrow-datafusion/blob/7eb3bd8/datafusion-physical-expr/Cargo.toml#L35-L39

match value.data_type() {
DataType::Int8 => {
let array = value.as_any().downcast_ref::<Int8Array>().unwrap();
for i in 0..array.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code doesn't seem to handle nulls (as in don't you have to check array.is_valid(i) prior to getting array.value()?

A test case for null values would probably be useful

Maybe you could use an iterator like

for value in array.iter() {
  match value {
    Some(v) => self.bitmap.insert(value as u32);
    None => // do something with NULLs here
  }
}

@Dandandan
Copy link
Contributor

@Ted-Jiang

FYI https://github.com/RoaringBitmap/roaring-rs/releases/tag/v0.9.0 was just released (with big performance optimizations)

@Ted-Jiang
Copy link
Member Author

@Ted-Jiang

FYI https://github.com/RoaringBitmap/roaring-rs/releases/tag/v0.9.0 was just released (with big performance optimizations)

@Dandandan Thanks for your info, maybe i will redo the benchmark.

@@ -0,0 +1,229 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[brainstorming] how about we combine this to ApproxDistinct and use BitmapDistinctCountAccumulator for int8, int16 and int32 if the feature is avilable?

And use NumericHLLAccumulator for int64 and other non-int types. This way, user just need declare approx_distinct and rely on Datafusion to auto select the best approximate algorithm

https://github.com/apache/arrow-datafusion/blob/81592947e8814327ebdbd1fbc3d4a090796e37a3/datafusion-physical-expr/src/expressions/approx_distinct.rs#L91-L98


unrelate notes: as a user, I do want to keep count(distinct) as exact count and approx_distinct as approximation

@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

marking as draft (so it is easer to see what PRs are waiting for review)

@alamb alamb marked this pull request as draft April 15, 2022 14:50
@andygrove
Copy link
Member

Closing this PR since it has not been updated in a long time. Feel free to re-open if this is still being worked on. @Ted-Jiang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

implement bitmap_distinct function using bitmap

9 participants