Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch. by gianm · Pull Request #353 · apache/datasketches-java

gianm · 2021-05-05T21:19:51Z

My motivation for introducing the ByteBuffer methods is to allow Druid to pass mapped buffers directly to an HllSketch; see apache/druid#11172, apache/druid#11201. We're trying to eliminate unnecessary decodes and copies from the string-column-to-hll-sketch execution path.

Benchmarks in the second Druid PR, without this change are clocking about 50ns per row for very short strings (the benchmark is mostly doing 2–4 character strings). We can speed up the per-row time on this benchmark by another 10ns or so by applying this patch here to datasketches-java, which lets us eliminate one more copy.

gianm · 2021-05-06T17:11:49Z

Hmm, I'm not sure what the test coverage error means. It doesn't sound like it is related to this patch.

leerho · 2021-05-07T05:58:30Z

I will take a look :)

On Thu, May 6, 2021 at 10:12 AM Gian Merlino ***@***.***> wrote: Hmm, I'm not sure what the test coverage error means. It doesn't sound like it is related to this patch. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#353 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCXRQTG7NO5QPMQSGDQZ5DTMLEW3ANCNFSM44FY4T4A> .

-- From my cell phone.

leerho · 2021-05-07T18:29:52Z

Gian
Please modify this PR to merge into "Gian-MurmurHash3" instead of "master".
Thanks,
Lee.

gianm · 2021-05-07T18:50:50Z

Gian
Please modify this PR to merge into "Gian-MurmurHash3" instead of "master".
Thanks,
Lee.

@leerho Could you please create the branch first? The GitHub UI only lets me select pre-existing branches, and I don't have permissions to create a new branch.

leerho · 2021-05-07T21:10:19Z

Sorry, I was about to do that earlier, but I had contractors in the house and I got pulled away. It should be there now.

Lee.

leerho · 2021-05-07T21:15:46Z

I didn't realized I could do it myself. Thanks anyway.

leerho · 2021-05-07T23:25:41Z

@giianm
Now that I've had a look at this, I have some questions and possible suggestions to make it even faster.

It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.)
As I recall from some characterization studies I did several years ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong() because under the covers, for the ByteBuffer.getLong() call, the BB does a brute-force byte conversion of the long, rather than calling Unsafe.getLong(). And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better.
Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong.

WRT the MurmurHash3 modifications.
You should know that the MurmurHash3 is a great hash function, and we chose it a long time ago and are stuck with it for binary compatibility reasons. Nonetheless, there is a newer hash function, the XxHash that has comparable avalanche and bit-independence to the MurmurHash3 and is about twice as fast. We already have the XxHash integrated into the datasketches.hash package as well as in the Memory component. And, in fact, it was integrated into the Memory component for just this kind of case.

So consider this approach:

Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the bytes underneath the BB.
Hash the BB bytes using XxHash. This is already a built-in function and is very fast.
Take the resulting 64-bit hash and feed it as a long into HLL. Yes, you will be hashing twice, but if the size of your BB is more than 10 or so longs, the speed you gain from the xxHash will make up for the double hashing.

How much improvement you get will depend on your distribution of BB sizes, which comes back to my question in #1.

Keep in mind that Panama is now integrated into JDK14+, (JDK16 is the most recent) and I am in process of migrating the whole Java library, first to JDK9-13, and then highly leveraging Panama starting with JDK17, which is the next LTS after JDK11.

With Panama the scheme that you would use would be identical except you would use the new Panama MemorySegment instead of our Memory, but the steps would be the same. So the switch over to Panama should be really easy.

Lee.

leerho · 2021-05-10T00:35:24Z

@gian Merlino ***@***.***> I'm not sure you saw this as I posted it on the PR after it was closed. My bad. Now that I've had a look at this, I have some questions and possible suggestions to make it even faster. 1. It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.) 2. As I recall from some characterization studies I did several years ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong() because under the covers, for the ByteBuffer.getLong() call, the BB does a brute-force byte conversion of the long, rather than calling Unsafe.getLong(). And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better. 3. Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong. WRT the MurmurHash3 modifications. You should know that the MurmurHash3 is a great hash function, and we chose it a long time ago and are stuck with it for binary compatibility reasons. Nonetheless, there is a newer hash function, the XxHash that has comparable avalanche and bit-independence to the MurmurHash3 and is about twice as fast. We already have the XxHash integrated into the datasketches.hash package as well as in the Memory component. And, in fact, it was integrated into the Memory component for just this kind of case. So consider this approach: - Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the bytes underneath the BB. - Hash the BB bytes using XxHash. This is already a built-in function and is very fast. - Take the resulting 64-bit hash and feed it as a long into HLL. Yes, you will be hashing twice, but if the size of your BB is more than 10 or so longs, the speed you gain from the xxHash will make up for the double hashing. How much improvement you get will depend on your distribution of BB sizes, which comes back to my question in #1 <#1>. Keep in mind that Panama is now integrated into JDK14+, (JDK16 is the most recent) and I am in process of migrating the whole Java library, first to JDK9-13, and then highly leveraging Panama starting with JDK17, which is the next LTS after JDK11. With Panama the scheme that you would use would be identical except you would use the new Panama MemorySegment instead of our Memory, but the steps would be the same. So the switch over to Panama should be really easy. Lee.

…

On Fri, May 7, 2021 at 11:51 AM Gian Merlino ***@***.***> wrote: Gian Please modify this PR to merge into "Gian-MurmurHash3" instead of "master". Thanks, Lee. @leerho <https://github.com/leerho> Could you please create the branch first? The GitHub UI only lets me select pre-existing branches, and I don't have permissions to create a new branch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#353 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCXRQU6CGUCQGCADG6I45DTMQZB7ANCNFSM44FY4T4A> .

gianm · 2021-05-10T17:34:17Z

@leerho — some thoughts on your questions:

It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.)

The use case I had in mind is query-time approximate count distinct on string columns. So the distribution of sizes would vary based on the column. I would guess that most of the time, people want distinct counts of things like usernames or user ids (8–50 chars) or short indicator variables (1–8 chars). The case that inspired us to work on this patch involved 16-char strings.

As I recall from some characterization studies I did several years ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong() because under the covers, for the ByteBuffer.getLong() call, the BB does a brute-force byte conversion of the long, rather than calling Unsafe.getLong(). And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better.

Are you suggesting that we convert the ByteBuffer to a LongBuffer first, and then do a getLong on that?

Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong.

I did try both; I wrote a jmh benchmark and ran it on an AWS m5.large and got the following results:

Benchmark    (style)  (numBytes)  Mode  Cnt   Score   Error  Units
hash_buffer     loop            4  avgt   15  18.924 ± 0.031  ns/op
hash_buffer     loop           12  avgt   15  20.665 ± 0.044  ns/op
hash_buffer     loop           20  avgt   15  39.914 ± 0.074  ns/op
hash_buffer     loop           40  avgt   15  37.688 ± 0.104  ns/op
hash_buffer     loop           64  avgt   15  41.518 ± 0.097  ns/op
hash_buffer   switch            4  avgt   15  14.681 ± 0.022  ns/op
hash_buffer   switch           12  avgt   15  16.652 ± 0.028  ns/op
hash_buffer   switch           20  avgt   15  35.020 ± 0.030  ns/op
hash_buffer   switch           40  avgt   15  35.596 ± 0.052  ns/op
hash_buffer   switch           64  avgt   15  39.433 ± 0.175  ns/op

"switch" is what is in this patch. "loop" is something that is basically the same as the current datasketches-java implementation for byte arrays (a small tight loop).

So consider this approach:

Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the
bytes underneath the BB.

Hash the BB bytes using XxHash. This is already a built-in function
and is very fast.

Take the resulting 64-bit hash and feed it as a long into HLL. Yes,
you will be hashing twice, but if the size of your BB is more than 10 or so
longs, the speed you gain from the xxHash will make up for the double
hashing.

Interesting idea — we'll keep that in mind. I suspect that in most cases, though, the string size will be less than 80 bytes.

leerho · 2021-05-11T01:57:45Z

@gian Merlino ***@***.***> Are you suggesting that we convert the ByteBuffer to a LongBuffer first,

and then do a getLong on that?

No. That would only work if your buffer sizes were always a multiple of 8 bytes, and I doubt you can guarantee that. I was telling you this to let you know that ByteBuffer.getLong() is quite slow on its own. Whether you wrap it inside a loop or a switch doesn't fix the problem that the call itself is slow. I didn't realize your BBs were so small. Hmm. You must be doing lots of ByteBuffer.slice() calls to get those. And BB.slice() is not known to be very fast either, although I haven't actually characterized it. Nonetheless, all those slices are creating objects that ultimately the GC has to collect... not good. (I am speculating what your situation is .... danger! ) I presume your "column" is a much larger BB sector that may have thousands of these identifiers in it, prior to your creating slices based on offsets. Consider this: - Wrap the entire column into Memory. - The xxHash method takes an offset and length ... - loop on the *long memory.xxHash64(offsetBytes, lengthBytes, seed)* //rather than creating slice objects. This returns the xxhash immediately that you can insert into the sketch. I.E., *sketch.update(memoryxxHash64(offset, length, seed);* This bypasses the creation of the BB.slice() objects, the slow BB.getLong() method, and the need for your fancy switch. Lee.

…

On Mon, May 10, 2021 at 10:34 AM Gian Merlino ***@***.***> wrote: @leerho <https://github.com/leerho> — some thoughts on your questions: - It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.) The use case I had in mind is query-time approximate count distinct on string columns. So the distribution of sizes would vary based on the column. I would guess that most of the time, people want distinct counts of things like usernames or user ids (8–50 chars) or short indicator variables (1–8 chars). The case that inspired us to work on this patch involved 16-char strings. - As I recall from some characterization studies I did several years ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong() because under the covers, for the ByteBuffer.getLong() call, the BB does a brute-force byte conversion of the long, rather than calling Unsafe.getLong(). And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better. Are you suggesting that we convert the ByteBuffer to a LongBuffer first, and then do a getLong on that? - Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong. I did try both; I wrote a jmh benchmark and ran it on an AWS m5.large and got the following results: hash_buffer loop 4 avgt 15 18.924 ± 0.031 ns/op hash_buffer loop 12 avgt 15 20.665 ± 0.044 ns/op hash_buffer loop 20 avgt 15 39.914 ± 0.074 ns/op hash_buffer loop 40 avgt 15 37.688 ± 0.104 ns/op hash_buffer loop 64 avgt 15 41.518 ± 0.097 ns/op hash_buffer switch 4 avgt 15 14.681 ± 0.022 ns/op hash_buffer switch 12 avgt 15 16.652 ± 0.028 ns/op hash_buffer switch 20 avgt 15 35.020 ± 0.030 ns/op hash_buffer switch 40 avgt 15 35.596 ± 0.052 ns/op hash_buffer switch 64 avgt 15 39.433 ± 0.175 ns/op "switch" is what is in this patch. "loop" is something that is basically the same as the current datasketches-java implementation for byte arrays (a small tight loop). So consider this approach: - Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the bytes underneath the BB. - Hash the BB bytes using XxHash. This is already a built-in function and is very fast. - Take the resulting 64-bit hash and feed it as a long into HLL. Yes, you will be hashing twice, but if the size of your BB is more than 10 or so longs, the speed you gain from the xxHash will make up for the double hashing. Interesting idea — we'll keep that in mind. I suspect that in most cases, though, the string size will be fewer than 80 bytes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#353 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCXRQTRTHOIU5VMSFRWYHTTNAKK5ANCNFSM44FY4T4A> .

gianm · 2021-05-11T02:07:59Z

Consider this:

Wrap the entire column into Memory.

The xxHash method takes an offset and length ...

loop on the long memory.xxHash64(offsetBytes, lengthBytes, seed) //rather than creating slice objects. This returns the xxhash immediately that you can insert into the sketch. I.E., sketch.update(memoryxxHash64(offset, length, seed);

Yeah that makes sense. We'd like to do something like this eventually, for the reasons you describe, but it's a major effort and will take some time to find the bandwidth to do.

leerho · 2021-05-11T05:02:51Z

@gian Merlino ***@***.***> By the way, I hope you realize that when you wrap your large BB segment into Memory it is just a view of the contents of the BB. No copying. And everything that you are doing with that BB segment you can keep on doing, treating it still as a BB. So you will have two views of the contents of your BB segment one from the BB and one from Memory. I was hoping that you could create this Memory view just for this task you are trying to speed up :) Cheers, Lee.

…

On Mon, May 10, 2021 at 7:08 PM Gian Merlino ***@***.***> wrote: Consider this: - Wrap the entire column into Memory. - The xxHash method takes an offset and length ... - loop on the *long memory.xxHash64(offsetBytes, lengthBytes, seed)* //rather than creating slice objects. This returns the xxhash immediately that you can insert into the sketch. I.E., *sketch.update(memoryxxHash64(offset, length, seed);* Yeah that makes sense. We'd like to do something like this eventually, for the reasons you describe, but it's a major effort and will take some time to find the bandwidth to do. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#353 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCXRQT7QGRF44SHRZ7YJYTTNCGRDANCNFSM44FY4T4A> .

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.

c36f093

leerho changed the base branch from master to Gian-MurmurHash3 May 7, 2021 21:14

leerho merged commit f950acc into apache:Gian-MurmurHash3 May 7, 2021

leerho mentioned this pull request May 7, 2021

Add "stringEncoding" parameter to DataSketches HLL. apache/druid#11201

Merged

gianm deleted the bytebuffer-hashing branch May 10, 2021 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.#353

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.#353
leerho merged 1 commit intoapache:Gian-MurmurHash3from
gianm:bytebuffer-hashing

gianm commented May 5, 2021

Uh oh!

gianm commented May 6, 2021 •

edited

Loading

Uh oh!

leerho commented May 7, 2021 via email

Uh oh!

leerho commented May 7, 2021

Uh oh!

gianm commented May 7, 2021

Uh oh!

leerho commented May 7, 2021

Uh oh!

leerho commented May 7, 2021

Uh oh!

leerho commented May 7, 2021 •

edited

Loading

Uh oh!

leerho commented May 10, 2021 via email

Uh oh!

gianm commented May 10, 2021 •

edited

Loading

Uh oh!

leerho commented May 11, 2021 via email

Uh oh!

gianm commented May 11, 2021

Uh oh!

leerho commented May 11, 2021 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gianm commented May 5, 2021

Uh oh!

gianm commented May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leerho commented May 7, 2021 via email

Uh oh!

leerho commented May 7, 2021

Uh oh!

gianm commented May 7, 2021

Uh oh!

leerho commented May 7, 2021

Uh oh!

leerho commented May 7, 2021

Uh oh!

leerho commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leerho commented May 10, 2021 via email

Uh oh!

gianm commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leerho commented May 11, 2021 via email

Uh oh!

gianm commented May 11, 2021

Uh oh!

leerho commented May 11, 2021 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gianm commented May 6, 2021 •

edited

Loading

leerho commented May 7, 2021 •

edited

Loading

gianm commented May 10, 2021 •

edited

Loading