Skip to content

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.#353

Merged
leerho merged 1 commit intoapache:Gian-MurmurHash3from
gianm:bytebuffer-hashing
May 7, 2021
Merged

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.#353
leerho merged 1 commit intoapache:Gian-MurmurHash3from
gianm:bytebuffer-hashing

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented May 5, 2021

My motivation for introducing the ByteBuffer methods is to allow Druid to pass mapped buffers directly to an HllSketch; see apache/druid#11172, apache/druid#11201. We're trying to eliminate unnecessary decodes and copies from the string-column-to-hll-sketch execution path.

Benchmarks in the second Druid PR, without this change are clocking about 50ns per row for very short strings (the benchmark is mostly doing 2–4 character strings). We can speed up the per-row time on this benchmark by another 10ns or so by applying this patch here to datasketches-java, which lets us eliminate one more copy.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented May 6, 2021

Hmm, I'm not sure what the test coverage error means. It doesn't sound like it is related to this patch.

@leerho
Copy link
Copy Markdown
Member

leerho commented May 7, 2021 via email

@leerho
Copy link
Copy Markdown
Member

leerho commented May 7, 2021

Gian
Please modify this PR to merge into "Gian-MurmurHash3" instead of "master".
Thanks,
Lee.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented May 7, 2021

Gian
Please modify this PR to merge into "Gian-MurmurHash3" instead of "master".
Thanks,
Lee.

@leerho Could you please create the branch first? The GitHub UI only lets me select pre-existing branches, and I don't have permissions to create a new branch.

@leerho
Copy link
Copy Markdown
Member

leerho commented May 7, 2021

Sorry, I was about to do that earlier, but I had contractors in the house and I got pulled away. It should be there now.

Lee.

@leerho leerho changed the base branch from master to Gian-MurmurHash3 May 7, 2021 21:14
@leerho leerho merged commit f950acc into apache:Gian-MurmurHash3 May 7, 2021
@leerho
Copy link
Copy Markdown
Member

leerho commented May 7, 2021

I didn't realized I could do it myself. Thanks anyway.

@leerho
Copy link
Copy Markdown
Member

leerho commented May 7, 2021

@giianm
Now that I've had a look at this, I have some questions and possible suggestions to make it even faster.

  1. It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.)
  2. As I recall from some characterization studies I did several years ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong() because under the covers, for the ByteBuffer.getLong() call, the BB does a brute-force byte conversion of the long, rather than calling Unsafe.getLong(). And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better.
  3. Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong.

WRT the MurmurHash3 modifications.
You should know that the MurmurHash3 is a great hash function, and we chose it a long time ago and are stuck with it for binary compatibility reasons. Nonetheless, there is a newer hash function, the XxHash that has comparable avalanche and bit-independence to the MurmurHash3 and is about twice as fast. We already have the XxHash integrated into the datasketches.hash package as well as in the Memory component. And, in fact, it was integrated into the Memory component for just this kind of case.

So consider this approach:

  • Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the bytes underneath the BB.
  • Hash the BB bytes using XxHash. This is already a built-in function and is very fast.
  • Take the resulting 64-bit hash and feed it as a long into HLL. Yes, you will be hashing twice, but if the size of your BB is more than 10 or so longs, the speed you gain from the xxHash will make up for the double hashing.

How much improvement you get will depend on your distribution of BB sizes, which comes back to my question in #1.

Keep in mind that Panama is now integrated into JDK14+, (JDK16 is the most recent) and I am in process of migrating the whole Java library, first to JDK9-13, and then highly leveraging Panama starting with JDK17, which is the next LTS after JDK11.

With Panama the scheme that you would use would be identical except you would use the new Panama MemorySegment instead of our Memory, but the steps would be the same. So the switch over to Panama should be really easy.

Lee.

@leerho
Copy link
Copy Markdown
Member

leerho commented May 10, 2021 via email

@gianm gianm deleted the bytebuffer-hashing branch May 10, 2021 17:17
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented May 10, 2021

@leerho — some thoughts on your questions:

  • It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.)

The use case I had in mind is query-time approximate count distinct on string columns. So the distribution of sizes would vary based on the column. I would guess that most of the time, people want distinct counts of things like usernames or user ids (8–50 chars) or short indicator variables (1–8 chars). The case that inspired us to work on this patch involved 16-char strings.

  • As I recall from some characterization studies I did several years ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong() because under the covers, for the ByteBuffer.getLong() call, the BB does a brute-force byte conversion of the long, rather than calling Unsafe.getLong(). And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better.

Are you suggesting that we convert the ByteBuffer to a LongBuffer first, and then do a getLong on that?

  • Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong.

I did try both; I wrote a jmh benchmark and ran it on an AWS m5.large and got the following results:

Benchmark    (style)  (numBytes)  Mode  Cnt   Score   Error  Units
hash_buffer     loop            4  avgt   15  18.924 ± 0.031  ns/op
hash_buffer     loop           12  avgt   15  20.665 ± 0.044  ns/op
hash_buffer     loop           20  avgt   15  39.914 ± 0.074  ns/op
hash_buffer     loop           40  avgt   15  37.688 ± 0.104  ns/op
hash_buffer     loop           64  avgt   15  41.518 ± 0.097  ns/op
hash_buffer   switch            4  avgt   15  14.681 ± 0.022  ns/op
hash_buffer   switch           12  avgt   15  16.652 ± 0.028  ns/op
hash_buffer   switch           20  avgt   15  35.020 ± 0.030  ns/op
hash_buffer   switch           40  avgt   15  35.596 ± 0.052  ns/op
hash_buffer   switch           64  avgt   15  39.433 ± 0.175  ns/op

"switch" is what is in this patch. "loop" is something that is basically the same as the current datasketches-java implementation for byte arrays (a small tight loop).

So consider this approach:

  • Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the
    bytes underneath the BB.
  • Hash the BB bytes using XxHash. This is already a built-in function
    and is very fast.
  • Take the resulting 64-bit hash and feed it as a long into HLL. Yes,
    you will be hashing twice, but if the size of your BB is more than 10 or so
    longs, the speed you gain from the xxHash will make up for the double
    hashing.

Interesting idea — we'll keep that in mind. I suspect that in most cases, though, the string size will be less than 80 bytes.

@leerho
Copy link
Copy Markdown
Member

leerho commented May 11, 2021 via email

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented May 11, 2021

Consider this:

  • Wrap the entire column into Memory.
  • The xxHash method takes an offset and length ...
  • loop on the long memory.xxHash64(offsetBytes, lengthBytes, seed) //rather than creating slice objects. This returns the xxhash immediately that you can insert into the sketch. I.E., sketch.update(memoryxxHash64(offset, length, seed);

Yeah that makes sense. We'd like to do something like this eventually, for the reasons you describe, but it's a major effort and will take some time to find the bandwidth to do.

@leerho
Copy link
Copy Markdown
Member

leerho commented May 11, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants