Skip to content

Densify swapped hll buffer#6865

Merged
clintropolis merged 5 commits intoapache:masterfrom
drcrallen:hll/fixbufferoverflow
Mar 6, 2019
Merged

Densify swapped hll buffer#6865
clintropolis merged 5 commits intoapache:masterfrom
drcrallen:hll/fixbufferoverflow

Conversation

@drcrallen
Copy link
Copy Markdown
Contributor

@drcrallen drcrallen commented Jan 15, 2019

The legacy druid HLL module assumes the buffers are dense when doing a fold operation. This is not always the case and can lead to a BufferOverfloException. This PR checks to see if any densification needs to happen.

We had an upstream data producer who was sampling data. The sampling algorithm seemed to be based on Murmur3_128, or at least a related algorithm where the hash collisions were similar. When doing a HLL sketch of the dimension values, we were getting really weird results where all the HLL buckets would end up with values that were not good sketches of the input data (every bucket nibble with a 1 for example). testCanFillUpOnMod demonstrates such a scenario.

The unfortunate side effect of this was that the folding operation can easily cause corrupt buffers if the buffer folding in is sparse. testRegisterSwapWithSparse will fail against master at folded.toByteBuffer() similar to how the jackson serialization of the collector fails on historicals in the error mode we found.

java.nio.BufferOverflowException
	at java.nio.Buffer.nextPutIndex(Buffer.java:527)
	at java.nio.HeapByteBuffer.putShort(HeapByteBuffer.java:321)
	at org.apache.druid.hll.HyperLogLogCollector.toByteBuffer(HyperLogLogCollector.java:488)

With this PR applied, the query result does not crash, but does return as sketch that is useless, as demonstrated in the estimate cardinality checks during the added unit tests.

A tangential long term solution here would probably be to also seed the murmur hash with a custom value... but that will break historical compatibility in nasty ways.

@drcrallen drcrallen added the Bug label Jan 15, 2019
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jan 16, 2019

A tangential long term solution here would probably be to also seed the murmur hash with a custom value... but that will break historical compatibility in nasty ways.

On the topic of breaking historical compatibility (and an alternative HLL algo that uses a custom seed), check out #6814.

@jon-wei jon-wei added this to the 0.14.0 milestone Feb 1, 2019
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 5, 2019

Is it better to return a useless sketch than to throw an error? I would think the error is better if we know the results are going to be garbage. Maybe improving the error message is what's called for.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 6, 2019

I agree with @gianm.

We had an upstream data producer who was sampling data. The sampling algorithm seemed to be based on Murmur3_128, or at least a related algorithm where the hash collisions were similar. When doing a HLL sketch of the dimension values, we were getting really weird results where all the HLL buckets would end up with values that were not good sketches of the input data (every bucket nibble with a 1 for example).

Yikes!

I'm not sure I understand, but I hope that you are not sampling data prior to feeding it to a sketch. This will produce potentially horrible errors no matter what sketch you use. It also doesn't matter what hash function was used in the sampling either. Sketches are streaming algorithms and rely on being fed every item of the stream.

Nonetheless, these weird results with 1's in every nibble is a catastrophic failure of the sketch, I don't care what values were fed to it. There must be something very unusual about your use of the sketch. Some more detail about how you are using and feeding the sketch would be helpful.

@drcrallen
Copy link
Copy Markdown
Contributor Author

drcrallen commented Feb 7, 2019

@leerho it is doing a version of sampling (but NOT event sampling) prior to sending to the sketch. Specifically the sketch is against ALL events in a specific sub-set of the data.

Basically: Pick some qty of IDs. Assume that the IDs selected are a representative sample of the total population. Log all events from the IDs selected. Then sketches against the IDs should be fine for that sub-set with the knowledge that you can ONLY account for things happening in the sample population (ex: no or very very limited network effect analysis).

This tends to work pretty well for quick insights on big effects. The problem comes in when someone uses a simple hash(id) % some_number (or something that becomes effectively that) to determine if the ID should be part of the sample set AND a hll sketch uses the same hash fn with the same seed against id. An example of this is included in the unit tests.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@gianm I'm not convinced the results are always "good" in the absence of the error. Specifically, there are a number of "bad" ways to send in data that work contradictory to how a HLL sketch based on Murmur_128(seed=0) is intended to work. And not all of them will induce the kind of error presented here.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 8, 2019

@drcrallen

I have several comments on your situation:

This isn't true in practice. The druid-hll library lets callers use any hash function, but Druid doesn't expose that to end users. It always uses Hashing.murmur3_128()

The sketch must do its own hashing preferably with its own hash function and with a private seed and users should not peek inside and use the same hash function with the same seed for performing an upstream modulo sampling as you do in testCanFillUpOnMod().

HLL sketches are stochastic functions that rely on good randomness properties of the hash function that are independent of the incoming data! So by using the same exact hash function and the same seed in your mod function you are violating this independence property and all bets are off!

  • Nonetheless, what you also have uncovered is likely a bug. I took your test and added the DataSketches HLL sketch in parallel inside the test. I also added a few more outputs at the end and got these results:
Filled up registers after 3,918,870 random numbers
Count: 19590
HLLc Uniques: 0
DS-HLL Uniques: 19169.299781

The Count is the number of times a value is added to the sketch (at the bottom of the do loop). This is not the true number of uniques as there may be a few collisions amongst those 4M random numbers, but it was adequate for this experiment.

The Druid HLL shows a count of zero. I did not debug this but perhaps by checking the NumNonZeroRegisters variable just when it hits zero, you are catching the sketch just before it transitions. I am not sure, I am kinda surprised by this.

The DS-HLL shows a count of 19169 which is well within the error bounds of the sketch of that size.


As for your suggested change, I'm really not sure what ultimate effect it will have.

Cheers

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 8, 2019

@drcrallen I’m trying to understand what the bug was. Is it that the old folding code assumed that a sketch with an overflow register set would always be dense? And your change is patching it to densify the buffer if it turned out to be sparse?

If so then the general idea looks good to me. It sounds like it’s fixing an implementation bug that isn’t likely to get triggered by a properly injected flow of data, but is still an implementation bug nonetheless.

It also sounds like the zero estimate @leerho noted might be pointing to a different bug of some sort, not related to the one this patch is fixing? You wouldn’t expect to get a zero estimate when there are nonzero registers, right?

Separately, a side note. Discussion of flaws in the design and implementation of existing Druid features is very useful and appreciated. But when choosing words, I’d ask to please be civil and considerate of the fact that in many cases, the original designers and implementers are still around and part of the community.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 8, 2019

@gianm you are absolutely right and I apologize.
It was late in the day and I get frustrated at keep having to return to debugging this code. For my repentance I will dig in and figure out what is going on. :)

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 9, 2019

@gianm

Looking for guidance here...

I have spent the past few days studying the Druid-HLL code and have uncovered at least a half-dozen serious bugs and haven't even started on the merge logic, which from a brief look also has very serious problems. A number of these problems are interconnected, so you can't just fix one at a time.

This code needs to be redesigned from scratch. I'm not sure I want to undertake this, but if I were, I would insist on some major changes. The API will have some required changes: The biggest one is removing the ability for users to specify the hash function. Any users that are currently doing that, using a different hash function and have historical stored images may not be able to use their history.

I am considering 2 strategies:

  1. Rewrite existing code: Many current internal methods that are now public will become private or package-private: e.g., no public access to internals such as the overflow registers, getNumNonZeroRegisters, getHeaderBytes, getPayloadBytePosition, setVersion, etc. I may also insist on a merge class rather than a merge method ( fold() ).

The storage would be a little larger (from 1031 bytes to perhaps 1080 bytes). And merge performance may be a bit slower. It could still be backward compatible, but old images will still propagate errors into the new design and there is nothing that can be done about that. Users that record their history with the new design will see much better error performance.

  1. API Wrapper to the DS-HLL sketch. This will require some changes to the DS-HLL code, but may be easier to do and would eliminate a lot of code duplication. Basically I have to allow for the non-seeded hash function, adapt to the big-endian fields used by the Druid-HLL, create a merge function that can read the old Druid-HLL images, etc.

The advantage of this is (hopefully) a single code-base for the sketch internals with two different wrappers, one for the old Druid-HLL users, and a new full-function API wrapper for the DS-HLL customers.

Whatever, the new design would have to be extensively characterized and tested.

Thoughts?

@drcrallen
Copy link
Copy Markdown
Contributor Author

@gianm You are correct, the prior impl made some assumptions for a dense buffer, but we had some cases where it came in sparse. So this PR checks for such a case and densifies as needed.

It also adds some test cases for some "easy to accidentally hit" scenarios which are in the realm of what @leerho is talking about as far as challenges with the default Druid HLL implementation.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 11, 2019

@drcrallen , @gianm

I have a number of concerns about this PR that I am still investigating. Please don't merge this yet.

@drcrallen
Copy link
Copy Markdown
Contributor Author

@leerho IMHO if data sketches are intended to be enterprise sketch support library, then I think the effort put into fixing the druid HLL sketch library should be minimal.

I know we have effort internally to fix some of the accuracy issues in the existing HLL cardinality estimations. Beyond some high level fixes, if I had to choose between my team investing in fixing HLL fixes in the data format level, or finding a way to validate Data Sketches for enterprise sketching needs, I'd rather have time spent on Data Sketches validation and adoption since it has applications and impact outside of just Druid.

@drcrallen
Copy link
Copy Markdown
Contributor Author

Also please keep in mind this PR is not trying to make things perfect, just "better"

@drcrallen
Copy link
Copy Markdown
Contributor Author

I'm also worried #6865 (comment) won't reach enough audience here. @leerho if you don't have the info you need from this thread, posting such information and insights to the dev list would likely reach a more diverse audience.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bugfix LGTM 👍

I suggest we move the larger discussion about replacing hll with datasketches hll should be moved to an issue or the dev list (or both)

@clintropolis
Copy link
Copy Markdown
Member

#6814 might be the most appropriate place to continue discussion?

* This is a very long running test, disabled by default.
* It is meant to catch issues when combining a large numer of HLL objects.
*
* <p>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: accidental addition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@clintropolis
Copy link
Copy Markdown
Member

@leerho are your concerns specific to the fix this PR is doing or the Druid hll implementation in general?

This seems like it's fixing an oversight in the original implementation, since it was already converting itself to dense representation if it wasn't, but wasn't checking the other hll collector to do the same thing during the fold.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 17, 2019

@clintropolis

Sorry about the delay, I was pulled off onto other problems :)

This code has so many serious issues that I hardly know where to start. If we are going to submit fixes, then I would suggest that we fix an issue throughout the code and not just in one place.

  • Checking for sparse vs dense
    Specifically, your added check if sparse in the fold() routine may be OK on the surface, but one of the serious problems in the code is that the author uses a very fragile method for determining whether the sketch is sparse or dense and it is done differently in different parts of the code. (It should have been a simple boolean in the header!)

In your check you compare (and in other parts the author does this as well ...)
if (other.storageBuffer.remaining() != other.getNumBytesForDenseStorage())

The isSparce(buffer) method compares: buffer.remaining() != NUM_BYTES_FOR_BUCKETS;

They are different! NUM_BYTES_FOR_BUCKETS = 1024, while getNumBytesForDenseStorage() = 1031.

The author should have used isSparse(buffer) everywhere.

It is fragile because the remaining size of the buffer relies on the current state of the buffer position and limit. Yikes!

And isSparse(buffer) sometimes fails as follows:

One of the tests you added demonstrates this. In testRegisterSwapWithSparse() after you artificially fill all the registers with 1's, when you check the cardinality it returns zero and you say it is "fine".
It is not "fine". It is a symptom of more serious issues.

The reason it returns zero is because the toByteBuffer() returns a sparse representation when it should be dense, and estimateCardinality() of sparse uses the linear (or Poisson) estimator which is k log (k/v) where v = k, and the estimate is zero! The sketch should not have been sparse and even if it was sparse, the linear estimator should never be used when v approaches k!

These issues could be fixed, but it will require a version change and a lot of careful coding.

I have more, but I have to go now.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 18, 2019

@drcrallen @clintropolis
...continued:

  • I'm not sure if you noticed but Line 388 in HyperLogLogCollector just above your inserted lines is a no-op:

storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer());

The returned duplicated read-only buffer is never captured or used.

  • Inside the scope where you put your sparse check:
    if (getRegisterOffset() < other.getRegisterOffset()) {
      // "Swap" the buffers so that we are folding into the one with the higher offset
      final ByteBuffer tmpBuffer = ByteBuffer.allocate(storageBuffer.remaining());
      tmpBuffer.put(storageBuffer.asReadOnlyBuffer());
      tmpBuffer.clear();

      storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer());

      //Added by PR #6865
      if (other.storageBuffer.remaining() != other.getNumBytesForDenseStorage()) {
        // The other buffer was sparse, densify it
        final int newLImit = storageBuffer.position() + other.storageBuffer.remaining();
        storageBuffer.limit(newLImit);
        convertToDenseStorage();
      } //end of PR #6865

      other = HyperLogLogCollector.makeCollector(tmpBuffer);
    }

The other sketch can never be sparse! ... Once the sketch has increased the Register Offset bigger than zero, the sketch should never be placed back into the sparse state, ever! ... Unless you fill the registers artificially with impossible values as you do in your test!.

The HLL sketch is a complex state machine, and if implemented correctly, should never allow itself to be placed in illegal states especially from public methods. It is hard enough to validate that the state machine operates correctly, but placing them in illegal states and then expecting meaningful results is asking for trouble. You are not the only one to fall into this trap, clearly the author did too.

  • The swap logic itself is rather risky. Because of the existence of the overflow register, there are a number of corner cases that complicate the merging and it is hard to confirm that they are all handled correctly. The test code doesn't seem to check for these corner cases either. (There are much safer ways to do this merging, but that is another topic.)

  • WRT your test called testCanFillUpOnMod(). I sense that what you are trying to do is use this as an example of what NOT to do. However, it should be annotated with @ignore so that it doesn't always log the message "Filled up registers after %s random numbers".

Nonetheless, the real issue to warn users about is to never ever use Hashing.murmur3_128() without specifying a non-zero seed AND using that hash function to effectively select the hash values that are presented to the sketch. All that your test essentially does is select only odd hash values to be presented to the sketch, thus all the registers of the sketch will end up with 1's. The likelihood that this could ever occur using the sketch properly is probably longer than the age of the universe.

This goes back to what I have said before, that the hashing operation should never have been exposed and delegated to the user. But that is history.

In conclusion

  1. The additional code added to HyperLogLogCollector is unnecessary. It would only ever be executed from tests that create illegal states of the sketch. What should be modified is the tests!

  2. Both of the tests added to HyperLogLogCollectorTest use methods that place the sketch in illegal states. The first test could be rewritten to easily confirm proper swapping logic without doing this. The second test IMHO, is too complex and misses the bigger danger of misusing the hash function.

@clintropolis
Copy link
Copy Markdown
Member

@leerho Thanks for looking so deeply into this and for the detailed explanation! I am not very familiar with this area of the code and am definitely guilty of reviewing this change in a superficial manner without verifying that the other being in a sparse state was legitimate or not, or having a deeper understanding of the design that went into creating this hll collector in the first place. Maybe this should have been my cue to not get involved in this one 😜

The additional code added to HyperLogLogCollector is unnecessary. It would only ever be executed from tests that create illegal states of the sketch.

If I understand your analysis correctly, rather than changing nothing, if the other buffer should never be in the sparse state then this check should instead throw some sort of illegal state exception instead of allowing it to explode with BufferOverflowException. Explicitly checking for this illegal condition (and linking to this discussion thread in code comments) I think would be beneficial since the BufferOverflowException makes it appear is it is an implementation bug rather than a design flaw in the hll collector.

Other thoughts:

storageBuffer.duplicate().put(other.storageBuffer.asReadOnlyBuffer());

The returned duplicated read-only buffer is never captured or used.

I don't think this is actually a no-op, ByteBuffer#duplicate makes a new object that points to the same storage as the original buffer.. changes to the original show up in the duplicate, and the other way too, the duplicated buffer just has independent position, limit, etc. So this line is copying the contents of other into storageBuffer, without modifying the positions or limits of either, meaning the added check from this PR that follows this line adjusts the limit of the now modified storageBuffer and then densifies it if it is not 'dense' sized.

It sounds to me like there are too many design flaws in the hll collector as is to make it worth fixing in terms of correctness, since fixing sounds like would break compatibility anyway, so I think we should probably focus on just trying to get users away from using it. But I do think it is worth addressing implementation level things like this in some manner, either allowing design flaws to function as intended, however flawed they may be, or throwing sensible exceptions so users can at least know they have hit the limits of the capabilities of the algorithm, so issues like this don't appear as legitimate bugs in the implementation.

@leerho
Copy link
Copy Markdown
Member

leerho commented Feb 20, 2019

Thank you for your thoughtful comments. I looked again at the “noop” line ... and you are right ... I missed it! But it is a convoluted statement that would have benefited from a code comment!

I also agree that throwing an exception would make a lot of sense. That may cause some of the tests that take advantage of the ability to fill the Sketch with illegal sequences to fail. They would have to be rewritten. I’m not sure, you would have to try it and see.

An ISE would be fine, as long as the message indicates that the user of the Sketch has done something seriously wrong be feeding the Sketch with an illegal (astronomically unlikely) sequence.

This Sketch implementation is highly vulnerable to accidental corruption because the default hash seed used by the aggregator is zero. And most people when using hash functions never bother with a seed. So anybody using the Murmur3_128() in other parts of their code may inadvertently create correlation corruption effects with this Sketch. And there is no warning! Unfortunately, I don’t know of any way to detect this or fix it in a compatible way WRT historically stored sketches.

@drcrallen
Copy link
Copy Markdown
Contributor Author

The likelihood that this could ever occur using the sketch properly is probably longer than the age of the universe.

That is very true! But never underestimate the odd things that happen in production, where the things you thought were properly done turn out not to be so.

@drcrallen
Copy link
Copy Markdown
Contributor Author

Unless you fill the registers artificially with impossible values as you do in your test!

Or have poor controls in for handling upstream data sampling.

@drcrallen
Copy link
Copy Markdown
Contributor Author

2. The second test IMHO, is too complex and misses the bigger danger of misusing the hash function.

That is true as well, but I'm not trying to fix the hashing in druid in this PR. Just rectify one specific assumption (always dense) that is not guaranteed to be true for arbitrary data input (fails when input is improperly distributed).

@clintropolis
Copy link
Copy Markdown
Member

Hmm, so are we stuck here? I haven't had enough time to spare to dig in deep enough to the hll collector to have a strong opinion of my own, so my main concern is that we don't fail with a BufferOverflowException.

Whether that means a friendly error for a case that isn't supposed to happen, or simply handling that case that apparently can happen in practice under certain circumstances, seems to be the point of contention? Code-wise, the solution @drcrallen has doesn't look wrong to me, even if it's a design flaw that this could happen, but it sounds like the solution @leerho suggests is maybe the most correct in terms of .. correctness? Anyone else have any opinions here?

@drcrallen
Copy link
Copy Markdown
Contributor Author

drcrallen commented Mar 6, 2019

If the solutions were presented which fixed up items related to the core HLL aggregator itself. This part of the code still would have the assumption that the buffer is densified. I do not have full coverage for all test cases where it may or may not have dense buffers at this code point. So there may in the future be other cases where it is "normal". The workflow assumes dense so I added a check to rectify when that is not the case.

Fixes to the overall HLL algorithm itself are beyond the scope of this PR, and presenting HLL results which "don't make good sense" because of bad input values is not solved by this PR. Meaning if someone puts in values that do not encounter this error case, but still violate some of the (sparsely documented) design constraints of the HLL aggregator, then it will STILL present nonsense values even IF this particular code path fix is not hit. In other words NOT hitting this error case does NOT mean the HLL approximation is GOOD, just that it didn't crash.

Therefore I argue for keeping the code from crashing, thereby making ALL poorly constructed input values simply return whatever the algorithm can calculate.

@drcrallen
Copy link
Copy Markdown
Contributor Author

For longer term fixes, I'm not even sure if documenting the existing HLL design is worth it. For example, simply saying "Legacy HLL implementation has a lot of rough corner cases that are handled by its successor Data Sketches HLL" or similar.

@drcrallen
Copy link
Copy Markdown
Contributor Author

the problem with the sparseness checks is that sometimes the buffer has a read only with the headers, and sometimes it is a read only without the headers, and that changes which check is appropriate. The legacy of which one is used is very hard to track in the current design. The check as presented here is the correct check for this specific spot in the code

@clintropolis
Copy link
Copy Markdown
Member

@drcrallen I spent some time last night trying to sift through the hll code to have a stronger opinion, and I think I agree with you. I suspect anyone that has been using this is maybe already conditioned to expect to have occasionally wonky results, the flaw here feels maybe more like a design flaw to me at this time, and I do think energy would be better spent getting people on datasketches hll instead of trying to fix this. I'm going to merge after CI so we can add to 0.14 👍

@leerho if you strongly disagree with this position, we can open an issue to track further fixes or modifications to this hll algorithm.

@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Mar 6, 2019

For longer term fixes, I'm not even sure if documenting the existing HLL design is worth it. For example, simply saying "Legacy HLL implementation has a lot of rough corner cases that are handled by its successor Data Sketches HLL" or similar.

The 0.14.0 docs will deprecate the old HLL agg and point users to the DataSketches HLL instead: #7195

I'm okay with this PR being merged for 0.14.0, I think "avoid crashes in old HLL, use datasketches HLL for proper results" is a reasonable approach

@clintropolis clintropolis merged commit 3ed2507 into apache:master Mar 6, 2019
clintropolis pushed a commit to clintropolis/druid that referenced this pull request Mar 6, 2019
* Densify swapped hll buffer

* Make test loop limit pre-increment

* Reformat

* Fix test comments
@drcrallen drcrallen deleted the hll/fixbufferoverflow branch March 6, 2019 22:57
@leerho
Copy link
Copy Markdown
Member

leerho commented Mar 7, 2019

@drcrallen
I agree with what you are attempting to do.

clintropolis added a commit that referenced this pull request Mar 7, 2019
* Densify swapped hll buffer

* Make test loop limit pre-increment

* Reformat

* Fix test comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants