add hyperloglog implementation (`add` and `count`) #1095

jimexist · 2021-10-09T15:27:46Z

Which issue does this PR close?

Related #1087 and serves as a pre-cursor to that pull request
Closes #.

Rationale for this change

I'm thinking of adopting Redis's hyperloglog implementation along with https://github.com/crepererum/pdatastructs.rs/blob/master/src/hyperloglog.rs for its code structure.

I don't necessarily want to use @crepererum pdatastructs as is because:

this way the code size of datafusion is well controlled and since we can make some reasonably good strong assumption on the usage pattern, we can skip some of the generics and adopt Redis's approach (e.g. precision = 14 with 16384 registers, but also each with u8 size instead of the raw byte array in C in order to save space, trading off with rather tricky bit shifting techniques - this way instead of max size of 12KiB we'll have 16KiB per HLL)
also we need to have either bincode serde so that merge can work across record batches

What changes are included in this PR?

Are there any user-facing changes?

datafusion/src/physical_plan/hyperloglog/mod.rs

Dandandan · 2021-10-09T17:19:15Z

datafusion/src/physical_plan/hyperloglog/mod.rs

+
+    /// Adds an element to the HyperLogLog.
+    pub fn add(&mut self, obj: &T) {
+        let mut hasher = self.buildhasher.build_hasher();


As we're using only aHash in this usage here, we could use the aHash methods and avoid the generic hash builder api.

datafusion/src/physical_plan/hyperloglog/mod.rs

houqp

nice work!

Dandandan · 2021-10-11T10:47:26Z

datafusion/src/physical_plan/hyperloglog/mod.rs

+    /// reasonable performance.
+    #[inline]
+    fn hash_value(&self, obj: &T) -> u64 {
+        let mut hasher = AHasher::default();


I think the builder or randomstate has to be reused across calls, otherwise it will generate different hashes across calls for the same value.

And wouldn't that be okay and also contribute to probabilistic stability?

After the second thought that was a good point so let me change this

You have to use the same hash function for the same session (or for all sessions). If not, then your hash function is more or less a random number generator (in the worst case) and you're (depending on how many hash functions you generate) overestimate cardinality (because the same input now creates different hashes and looks like different inputs).

What you do here is half-correct: AHasher::default uses either a compile-time seed or a process-time seed (once drawn using global state). This behavior is OK as long as the query is local and you don't serialize the HLL. However, if you have multiple processes (e.g. ballista) or someone is using serde to dump and restore the HLL, this breaks because now you have a totally different hash function.

To future-proof this whole thing and to make DoS attacks less likely, I would advice you to make the hasher a runtime parameter of the HLL, generate it once during query planning and serialize the hasher alongside the query plan and the intermediate HLLs.

For now maybe a fixed RandomState would be OK-ish? In the join / aggregate algorithms we are also doing this. @houqp do you have an opinion on this? I guess usages like ROAPI have a higher chance of being vulnerable to DoS attacks by exposing an API to the end user. We should do something like what @crepererum suggests to make it work for e.g. ballista.
There are of course more parts of the query engine you could abuse for DoS attacks, like generating cross joins, having a very large number of columns, etc. so maybe it makes more sense to spend some time in being able to e.g. specify and handle timeouts for query execution.

All are good points thanks for the discussion. I've decided to use a static random state for the moment and revisit the decision once cross session serialization is in scope.

alamb

This is some cool work @jimexist

chitralverma · 2023-04-04T12:32:45Z

@jimexist how do you think the current implementation compares with this crate?

## Which issue does this PR close? Closes apache/datafusion-comet#1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types

github-actions bot added the datafusion label Oct 9, 2021

Dandandan reviewed Oct 9, 2021

View reviewed changes

datafusion/src/physical_plan/hyperloglog/mod.rs Outdated Show resolved Hide resolved

Dandandan reviewed Oct 9, 2021

View reviewed changes

jimexist force-pushed the add-hyperloglog-impl branch from afae97a to 0abcd6e Compare October 10, 2021 08:05

jimexist changed the title ~~[WIP] add hyperloglog implementation~~ add hyperloglog implementation (part I, add and count) Oct 10, 2021

jimexist marked this pull request as ready for review October 10, 2021 08:06

jimexist force-pushed the add-hyperloglog-impl branch 7 times, most recently from 4d395ea to 51d9df1 Compare October 10, 2021 12:54

jimexist requested a review from Dandandan October 10, 2021 12:54

jimexist force-pushed the add-hyperloglog-impl branch from 51d9df1 to c8186f4 Compare October 10, 2021 15:33

houqp reviewed Oct 10, 2021

View reviewed changes

datafusion/src/physical_plan/hyperloglog/mod.rs Show resolved Hide resolved

houqp approved these changes Oct 10, 2021

View reviewed changes

jimexist requested a review from alamb October 11, 2021 03:32

jimexist changed the title ~~add hyperloglog implementation (part I, add and count)~~ add hyperloglog implementation (add and count) Oct 11, 2021

Dandandan reviewed Oct 11, 2021

View reviewed changes

jimexist mentioned this pull request Oct 11, 2021

implement approx_distinct function using HyperLogLog #1087

Merged

jimexist force-pushed the add-hyperloglog-impl branch from c8186f4 to ff81436 Compare October 11, 2021 12:36

jimexist added 2 commits October 11, 2021 20:56

add hyperloglog implementation

0651ba2

adding string type

48fe5c8

jimexist force-pushed the add-hyperloglog-impl branch from ff81436 to 48fe5c8 Compare October 11, 2021 12:56

Dandandan approved these changes Oct 11, 2021

View reviewed changes

jimexist merged commit 246fd61 into apache:master Oct 11, 2021

jimexist deleted the add-hyperloglog-impl branch October 11, 2021 16:48

alamb reviewed Oct 11, 2021

View reviewed changes

houqp added the enhancement New feature or request label Oct 13, 2021

chitralverma mentioned this pull request Apr 3, 2023

feat(python, rust): Add approx distinct count via approx_unique() pola-rs/polars#7937

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add hyperloglog implementation (`add` and `count`) #1095

add hyperloglog implementation (`add` and `count`) #1095

Uh oh!

jimexist commented Oct 9, 2021 •

edited

Loading

Uh oh!

Uh oh!

Dandandan Oct 9, 2021

Uh oh!

Uh oh!

houqp left a comment

Uh oh!

Dandandan Oct 11, 2021

Uh oh!

jimexist Oct 11, 2021

Uh oh!

jimexist Oct 11, 2021

Uh oh!

crepererum Oct 11, 2021

Uh oh!

Dandandan Oct 11, 2021 •

edited

Loading

Uh oh!

jimexist Oct 11, 2021

Uh oh!

alamb left a comment

Uh oh!

chitralverma commented Apr 4, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

add hyperloglog implementation (add and count) #1095

add hyperloglog implementation (add and count) #1095

Uh oh!

Conversation

jimexist commented Oct 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Uh oh!

Dandandan Oct 9, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

jimexist Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

jimexist Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

crepererum Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Oct 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimexist Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

chitralverma commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

add hyperloglog implementation (`add` and `count`) #1095

add hyperloglog implementation (`add` and `count`) #1095

jimexist commented Oct 9, 2021 •

edited

Loading

Dandandan Oct 11, 2021 •

edited

Loading

chitralverma commented Apr 4, 2023 •

edited

Loading