Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.#353
Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.#353leerho merged 1 commit intoapache:Gian-MurmurHash3from
Conversation
|
Hmm, I'm not sure what the test coverage error means. It doesn't sound like it is related to this patch. |
|
I will take a look :)
On Thu, May 6, 2021 at 10:12 AM Gian Merlino ***@***.***> wrote:
Hmm, I'm not sure what the test coverage error means. It doesn't sound
like it is related to this patch.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#353 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQTG7NO5QPMQSGDQZ5DTMLEW3ANCNFSM44FY4T4A>
.
--
From my cell phone.
|
|
Gian |
@leerho Could you please create the branch first? The GitHub UI only lets me select pre-existing branches, and I don't have permissions to create a new branch. |
|
Sorry, I was about to do that earlier, but I had contractors in the house and I got pulled away. It should be there now. Lee. |
|
I didn't realized I could do it myself. Thanks anyway. |
|
@giianm
WRT the MurmurHash3 modifications. So consider this approach:
How much improvement you get will depend on your distribution of BB sizes, which comes back to my question in #1. Keep in mind that Panama is now integrated into JDK14+, (JDK16 is the most recent) and I am in process of migrating the whole Java library, first to JDK9-13, and then highly leveraging Panama starting with JDK17, which is the next LTS after JDK11. With Panama the scheme that you would use would be identical except you would use the new Panama MemorySegment instead of our Memory, but the steps would be the same. So the switch over to Panama should be really easy. Lee. |
|
@gian Merlino ***@***.***>
I'm not sure you saw this as I posted it on the PR after it was closed. My
bad.
Now that I've had a look at this, I have some questions and possible
suggestions to make it even faster.
1. It would be helpful to me if you could give me an idea of the range
of sizes of these ByteBuffer objects that you wish to create a unique hash
of. Even better would be a description of the distribution of sizes (e.g,
95%ile is 100 bytes, median is 10 bytes, etc.)
2. As I recall from some characterization studies I did several years
ago, a ByteBuffer.getLong() is much slower than, a
LongBuffer.getLong() because
under the covers, for the ByteBuffer.getLong() call, the BB does a
brute-force byte conversion of the long, rather than calling
Unsafe.getLong(). And if your BB is of any reasonable size, the getLong
call will predominate. I understand why you chose this, but just keep in
mind what Java is doing under the covers, and I think we can do better.
3. Also, I'm a bit surprised that this switch statement would be faster
than just a tight loop. But perhaps you have recently characterized both
and I could be wrong.
WRT the MurmurHash3 modifications.
You should know that the MurmurHash3 is a great hash function, and we chose
it a long time ago and are stuck with it for binary compatibility reasons.
Nonetheless, there is a newer hash function, the XxHash that has comparable
avalanche and bit-independence to the MurmurHash3 and is about twice as
fast. We already have the XxHash integrated into the datasketches.hash package
as well as in the Memory component. And, in fact, it was integrated into
the Memory component for just this kind of case.
So consider this approach:
- Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the
bytes underneath the BB.
- Hash the BB bytes using XxHash. This is already a built-in function
and is very fast.
- Take the resulting 64-bit hash and feed it as a long into HLL. Yes,
you will be hashing twice, but if the size of your BB is more than 10 or so
longs, the speed you gain from the xxHash will make up for the double
hashing.
How much improvement you get will depend on your distribution of BB sizes,
which comes back to my question in #1
<#1>.
Keep in mind that Panama is now integrated into JDK14+, (JDK16 is the most
recent) and I am in process of migrating the whole Java library, first to
JDK9-13, and then highly leveraging Panama starting with JDK17, which is
the next LTS after JDK11.
With Panama the scheme that you would use would be identical except you
would use the new Panama MemorySegment instead of our Memory, but the steps
would be the same. So the switch over to Panama should be really easy.
Lee.
…On Fri, May 7, 2021 at 11:51 AM Gian Merlino ***@***.***> wrote:
Gian
Please modify this PR to merge into "Gian-MurmurHash3" instead of "master".
Thanks,
Lee.
@leerho <https://github.com/leerho> Could you please create the branch
first? The GitHub UI only lets me select pre-existing branches, and I don't
have permissions to create a new branch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#353 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQU6CGUCQGCADG6I45DTMQZB7ANCNFSM44FY4T4A>
.
|
|
@leerho — some thoughts on your questions:
The use case I had in mind is query-time approximate count distinct on string columns. So the distribution of sizes would vary based on the column. I would guess that most of the time, people want distinct counts of things like usernames or user ids (8–50 chars) or short indicator variables (1–8 chars). The case that inspired us to work on this patch involved 16-char strings.
Are you suggesting that we convert the ByteBuffer to a LongBuffer first, and then do a
I did try both; I wrote a jmh benchmark and ran it on an AWS m5.large and got the following results: "switch" is what is in this patch. "loop" is something that is basically the same as the current datasketches-java implementation for byte arrays (a small tight loop).
Interesting idea — we'll keep that in mind. I suspect that in most cases, though, the string size will be less than 80 bytes. |
|
@gian Merlino ***@***.***>
Are you suggesting that we convert the ByteBuffer to a LongBuffer first,
and then do a getLong on that?
No. That would only work if your buffer sizes were always a multiple of 8
bytes, and I doubt you can guarantee that. I was telling you this to let
you know that ByteBuffer.getLong() is quite slow on its own. Whether you
wrap it inside a loop or a switch doesn't fix the problem that the call
itself is slow.
I didn't realize your BBs were so small. Hmm. You must be doing lots of
ByteBuffer.slice() calls to get those. And BB.slice() is not known to be
very fast either, although I haven't actually characterized it.
Nonetheless, all those slices are creating objects that ultimately the GC
has to collect... not good.
(I am speculating what your situation is .... danger! )
I presume your "column" is a much larger BB sector that may have thousands
of these identifiers in it, prior to your creating slices based on offsets.
Consider this:
- Wrap the entire column into Memory.
- The xxHash method takes an offset and length ...
- loop on the *long memory.xxHash64(offsetBytes, lengthBytes, seed)*
//rather than creating slice objects.
This returns the xxhash immediately that you can insert into the
sketch. I.E.,
*sketch.update(memoryxxHash64(offset, length, seed);*
This bypasses the creation of the BB.slice() objects, the slow BB.getLong()
method, and the need for your fancy switch.
Lee.
…On Mon, May 10, 2021 at 10:34 AM Gian Merlino ***@***.***> wrote:
@leerho <https://github.com/leerho> — some thoughts on your questions:
- It would be helpful to me if you could give me an idea of the range
of sizes of these ByteBuffer objects that you wish to create a unique hash
of. Even better would be a description of the distribution of sizes (e.g,
95%ile is 100 bytes, median is 10 bytes, etc.)
The use case I had in mind is query-time approximate count distinct on
string columns. So the distribution of sizes would vary based on the
column. I would guess that most of the time, people want distinct counts of
things like usernames or user ids (8–50 chars) or short indicator variables
(1–8 chars). The case that inspired us to work on this patch involved
16-char strings.
- As I recall from some characterization studies I did several years
ago, a ByteBuffer.getLong() is much slower than, a LongBuffer.getLong()
because under the covers, for the ByteBuffer.getLong() call, the BB
does a brute-force byte conversion of the long, rather than calling
Unsafe.getLong(). And if your BB is of any reasonable size, the
getLong call will predominate. I understand why you chose this, but just
keep in mind what Java is doing under the covers, and I think we can do
better.
Are you suggesting that we convert the ByteBuffer to a LongBuffer first,
and then do a getLong on that?
- Also, I'm a bit surprised that this switch statement would be faster
than just a tight loop. But perhaps you have recently characterized both
and I could be wrong.
I did try both; I wrote a jmh benchmark and ran it on an AWS m5.large and
got the following results:
hash_buffer loop 4 avgt 15 18.924 ± 0.031 ns/op
hash_buffer loop 12 avgt 15 20.665 ± 0.044 ns/op
hash_buffer loop 20 avgt 15 39.914 ± 0.074 ns/op
hash_buffer loop 40 avgt 15 37.688 ± 0.104 ns/op
hash_buffer loop 64 avgt 15 41.518 ± 0.097 ns/op
hash_buffer switch 4 avgt 15 14.681 ± 0.022 ns/op
hash_buffer switch 12 avgt 15 16.652 ± 0.028 ns/op
hash_buffer switch 20 avgt 15 35.020 ± 0.030 ns/op
hash_buffer switch 40 avgt 15 35.596 ± 0.052 ns/op
hash_buffer switch 64 avgt 15 39.433 ± 0.175 ns/op
"switch" is what is in this patch. "loop" is something that is basically
the same as the current datasketches-java implementation for byte arrays (a
small tight loop).
So consider this approach:
- Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the
bytes underneath the BB.
- Hash the BB bytes using XxHash. This is already a built-in function
and is very fast.
- Take the resulting 64-bit hash and feed it as a long into HLL. Yes,
you will be hashing twice, but if the size of your BB is more than 10
or so
longs, the speed you gain from the xxHash will make up for the double
hashing.
Interesting idea — we'll keep that in mind. I suspect that in most cases,
though, the string size will be fewer than 80 bytes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#353 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQTRTHOIU5VMSFRWYHTTNAKK5ANCNFSM44FY4T4A>
.
|
Yeah that makes sense. We'd like to do something like this eventually, for the reasons you describe, but it's a major effort and will take some time to find the bandwidth to do. |
|
@gian Merlino ***@***.***>
By the way, I hope you realize that when you wrap your large BB segment
into Memory it is just a view of the contents of the BB. No copying. And
everything that you are doing with that BB segment you can keep on
doing, treating it still as a BB. So you will have two views of the
contents of your BB segment one from the BB and one from Memory. I was
hoping that you could create this Memory view just for this task you are
trying to speed up :)
Cheers,
Lee.
…On Mon, May 10, 2021 at 7:08 PM Gian Merlino ***@***.***> wrote:
Consider this:
- Wrap the entire column into Memory.
- The xxHash method takes an offset and length ...
- loop on the *long memory.xxHash64(offsetBytes, lengthBytes, seed)*
//rather than creating slice objects. This returns the xxhash immediately
that you can insert into the sketch. I.E., *sketch.update(memoryxxHash64(offset,
length, seed);*
Yeah that makes sense. We'd like to do something like this eventually, for
the reasons you describe, but it's a major effort and will take some time
to find the bandwidth to do.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#353 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQT7QGRF44SHRZ7YJYTTNCGRDANCNFSM44FY4T4A>
.
|
My motivation for introducing the ByteBuffer methods is to allow Druid to pass mapped buffers directly to an HllSketch; see apache/druid#11172, apache/druid#11201. We're trying to eliminate unnecessary decodes and copies from the string-column-to-hll-sketch execution path.
Benchmarks in the second Druid PR, without this change are clocking about 50ns per row for very short strings (the benchmark is mostly doing 2–4 character strings). We can speed up the per-row time on this benchmark by another 10ns or so by applying this patch here to datasketches-java, which lets us eliminate one more copy.