-
Notifications
You must be signed in to change notification settings - Fork 1.9k
add hyperloglog implementation (add and count)
#1095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| /// Adds an element to the HyperLogLog. | ||
| pub fn add(&mut self, obj: &T) { | ||
| let mut hasher = self.buildhasher.build_hasher(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we're using only aHash in this usage here, we could use the aHash methods and avoid the generic hash builder api.
afae97a to
0abcd6e
Compare
add and count)
4d395ea to
51d9df1
Compare
51d9df1 to
c8186f4
Compare
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work!
add and count)add and count)
| /// reasonable performance. | ||
| #[inline] | ||
| fn hash_value(&self, obj: &T) -> u64 { | ||
| let mut hasher = AHasher::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the builder or randomstate has to be reused across calls, otherwise it will generate different hashes across calls for the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And wouldn't that be okay and also contribute to probabilistic stability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the second thought that was a good point so let me change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to use the same hash function for the same session (or for all sessions). If not, then your hash function is more or less a random number generator (in the worst case) and you're (depending on how many hash functions you generate) overestimate cardinality (because the same input now creates different hashes and looks like different inputs).
What you do here is half-correct: AHasher::default uses either a compile-time seed or a process-time seed (once drawn using global state). This behavior is OK as long as the query is local and you don't serialize the HLL. However, if you have multiple processes (e.g. ballista) or someone is using serde to dump and restore the HLL, this breaks because now you have a totally different hash function.
To future-proof this whole thing and to make DoS attacks less likely, I would advice you to make the hasher a runtime parameter of the HLL, generate it once during query planning and serialize the hasher alongside the query plan and the intermediate HLLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now maybe a fixed RandomState would be OK-ish? In the join / aggregate algorithms we are also doing this. @houqp do you have an opinion on this? I guess usages like ROAPI have a higher chance of being vulnerable to DoS attacks by exposing an API to the end user. We should do something like what @crepererum suggests to make it work for e.g. ballista.
There are of course more parts of the query engine you could abuse for DoS attacks, like generating cross joins, having a very large number of columns, etc. so maybe it makes more sense to spend some time in being able to e.g. specify and handle timeouts for query execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All are good points thanks for the discussion. I've decided to use a static random state for the moment and revisit the decision once cross session serialization is in scope.
c8186f4 to
ff81436
Compare
ff81436 to
48fe5c8
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some cool work @jimexist
|
@jimexist how do you think the current implementation compares with this crate? |
## Which issue does this PR close? Closes apache/datafusion-comet#1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types
## Which issue does this PR close? Closes apache/datafusion-comet#1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types
Which issue does this PR close?
Related #1087 and serves as a pre-cursor to that pull request
Closes #.
Rationale for this change
I'm thinking of adopting Redis's hyperloglog implementation along with https://github.com/crepererum/pdatastructs.rs/blob/master/src/hyperloglog.rs for its code structure.
I don't necessarily want to use @crepererum
pdatastructsas is because:u8size instead of the raw byte array in C in order to save space, trading off with rather tricky bit shifting techniques - this way instead of max size of 12KiB we'll have 16KiB per HLL)mergecan work across record batchesWhat changes are included in this PR?
Are there any user-facing changes?