Skip to content

Conversation

@jimexist
Copy link
Member

@jimexist jimexist commented Oct 9, 2021

Which issue does this PR close?

Related #1087 and serves as a pre-cursor to that pull request
Closes #.

Rationale for this change

I'm thinking of adopting Redis's hyperloglog implementation along with https://github.com/crepererum/pdatastructs.rs/blob/master/src/hyperloglog.rs for its code structure.

I don't necessarily want to use @crepererum pdatastructs as is because:

  1. this way the code size of datafusion is well controlled and since we can make some reasonably good strong assumption on the usage pattern, we can skip some of the generics and adopt Redis's approach (e.g. precision = 14 with 16384 registers, but also each with u8 size instead of the raw byte array in C in order to save space, trading off with rather tricky bit shifting techniques - this way instead of max size of 12KiB we'll have 16KiB per HLL)
  2. also we need to have either bincode serde so that merge can work across record batches

What changes are included in this PR?

Are there any user-facing changes?


/// Adds an element to the HyperLogLog.
pub fn add(&mut self, obj: &T) {
let mut hasher = self.buildhasher.build_hasher();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we're using only aHash in this usage here, we could use the aHash methods and avoid the generic hash builder api.

@jimexist jimexist force-pushed the add-hyperloglog-impl branch from afae97a to 0abcd6e Compare October 10, 2021 08:05
@jimexist jimexist changed the title [WIP] add hyperloglog implementation add hyperloglog implementation (part I, add and count) Oct 10, 2021
@jimexist jimexist marked this pull request as ready for review October 10, 2021 08:06
@jimexist jimexist force-pushed the add-hyperloglog-impl branch 7 times, most recently from 4d395ea to 51d9df1 Compare October 10, 2021 12:54
@jimexist jimexist requested a review from Dandandan October 10, 2021 12:54
@jimexist jimexist force-pushed the add-hyperloglog-impl branch from 51d9df1 to c8186f4 Compare October 10, 2021 15:33
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

@jimexist jimexist requested a review from alamb October 11, 2021 03:32
@jimexist jimexist changed the title add hyperloglog implementation (part I, add and count) add hyperloglog implementation (add and count) Oct 11, 2021
/// reasonable performance.
#[inline]
fn hash_value(&self, obj: &T) -> u64 {
let mut hasher = AHasher::default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the builder or randomstate has to be reused across calls, otherwise it will generate different hashes across calls for the same value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And wouldn't that be okay and also contribute to probabilistic stability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the second thought that was a good point so let me change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to use the same hash function for the same session (or for all sessions). If not, then your hash function is more or less a random number generator (in the worst case) and you're (depending on how many hash functions you generate) overestimate cardinality (because the same input now creates different hashes and looks like different inputs).

What you do here is half-correct: AHasher::default uses either a compile-time seed or a process-time seed (once drawn using global state). This behavior is OK as long as the query is local and you don't serialize the HLL. However, if you have multiple processes (e.g. ballista) or someone is using serde to dump and restore the HLL, this breaks because now you have a totally different hash function.

To future-proof this whole thing and to make DoS attacks less likely, I would advice you to make the hasher a runtime parameter of the HLL, generate it once during query planning and serialize the hasher alongside the query plan and the intermediate HLLs.

Copy link
Contributor

@Dandandan Dandandan Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now maybe a fixed RandomState would be OK-ish? In the join / aggregate algorithms we are also doing this. @houqp do you have an opinion on this? I guess usages like ROAPI have a higher chance of being vulnerable to DoS attacks by exposing an API to the end user. We should do something like what @crepererum suggests to make it work for e.g. ballista.
There are of course more parts of the query engine you could abuse for DoS attacks, like generating cross joins, having a very large number of columns, etc. so maybe it makes more sense to spend some time in being able to e.g. specify and handle timeouts for query execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All are good points thanks for the discussion. I've decided to use a static random state for the moment and revisit the decision once cross session serialization is in scope.

@jimexist jimexist force-pushed the add-hyperloglog-impl branch from ff81436 to 48fe5c8 Compare October 11, 2021 12:56
@jimexist jimexist merged commit 246fd61 into apache:master Oct 11, 2021
@jimexist jimexist deleted the add-hyperloglog-impl branch October 11, 2021 16:48
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some cool work @jimexist

@houqp houqp added the enhancement New feature or request label Oct 13, 2021
@chitralverma
Copy link

chitralverma commented Apr 4, 2023

@jimexist how do you think the current implementation compares with this crate?

andygrove pushed a commit to andygrove/datafusion that referenced this pull request Jan 31, 2025
## Which issue does this PR close?

Closes apache/datafusion-comet#1067

## Rationale for this change

Bug fix. A few expressions were failing some unsigned type related tests

## What changes are included in this PR?

 - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits
 - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()`  (`>` vs `>=`)

## How are these changes tested?

Put back tests for unsigned types
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
## Which issue does this PR close?

Closes apache/datafusion-comet#1067

## Rationale for this change

Bug fix. A few expressions were failing some unsigned type related tests

## What changes are included in this PR?

 - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits
 - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()`  (`>` vs `>=`)

## How are these changes tested?

Put back tests for unsigned types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants