Densify swapped hll buffer#6865
Conversation
On the topic of breaking historical compatibility (and an alternative HLL algo that uses a custom seed), check out #6814. |
|
Is it better to return a useless sketch than to throw an error? I would think the error is better if we know the results are going to be garbage. Maybe improving the error message is what's called for. |
|
I agree with @gianm.
Yikes! I'm not sure I understand, but I hope that you are not sampling data prior to feeding it to a sketch. This will produce potentially horrible errors no matter what sketch you use. It also doesn't matter what hash function was used in the sampling either. Sketches are streaming algorithms and rely on being fed every item of the stream. Nonetheless, these weird results with 1's in every nibble is a catastrophic failure of the sketch, I don't care what values were fed to it. There must be something very unusual about your use of the sketch. Some more detail about how you are using and feeding the sketch would be helpful. |
|
@leerho it is doing a version of sampling (but NOT event sampling) prior to sending to the sketch. Specifically the sketch is against ALL events in a specific sub-set of the data. Basically: Pick some qty of IDs. Assume that the IDs selected are a representative sample of the total population. Log all events from the IDs selected. Then sketches against the IDs should be fine for that sub-set with the knowledge that you can ONLY account for things happening in the sample population (ex: no or very very limited network effect analysis). This tends to work pretty well for quick insights on big effects. The problem comes in when someone uses a simple |
|
@gianm I'm not convinced the results are always "good" in the absence of the error. Specifically, there are a number of "bad" ways to send in data that work contradictory to how a HLL sketch based on |
|
I have several comments on your situation:
The sketch must do its own hashing preferably with its own hash function and with a private seed and users should not peek inside and use the same hash function with the same seed for performing an upstream modulo sampling as you do in HLL sketches are stochastic functions that rely on good randomness properties of the hash function that are independent of the incoming data! So by using the same exact hash function and the same seed in your mod function you are violating this independence property and all bets are off!
The Count is the number of times a value is added to the sketch (at the bottom of the do loop). This is not the true number of uniques as there may be a few collisions amongst those 4M random numbers, but it was adequate for this experiment. The Druid HLL shows a count of zero. I did not debug this but perhaps by checking the NumNonZeroRegisters variable just when it hits zero, you are catching the sketch just before it transitions. I am not sure, I am kinda surprised by this. The DS-HLL shows a count of 19169 which is well within the error bounds of the sketch of that size. As for your suggested change, I'm really not sure what ultimate effect it will have. Cheers |
|
@drcrallen I’m trying to understand what the bug was. Is it that the old folding code assumed that a sketch with an overflow register set would always be dense? And your change is patching it to densify the buffer if it turned out to be sparse? If so then the general idea looks good to me. It sounds like it’s fixing an implementation bug that isn’t likely to get triggered by a properly injected flow of data, but is still an implementation bug nonetheless. It also sounds like the zero estimate @leerho noted might be pointing to a different bug of some sort, not related to the one this patch is fixing? You wouldn’t expect to get a zero estimate when there are nonzero registers, right? Separately, a side note. Discussion of flaws in the design and implementation of existing Druid features is very useful and appreciated. But when choosing words, I’d ask to please be civil and considerate of the fact that in many cases, the original designers and implementers are still around and part of the community. |
|
@gianm you are absolutely right and I apologize. |
|
Looking for guidance here... I have spent the past few days studying the Druid-HLL code and have uncovered at least a half-dozen serious bugs and haven't even started on the merge logic, which from a brief look also has very serious problems. A number of these problems are interconnected, so you can't just fix one at a time. This code needs to be redesigned from scratch. I'm not sure I want to undertake this, but if I were, I would insist on some major changes. The API will have some required changes: The biggest one is removing the ability for users to specify the hash function. Any users that are currently doing that, using a different hash function and have historical stored images may not be able to use their history. I am considering 2 strategies:
The storage would be a little larger (from 1031 bytes to perhaps 1080 bytes). And merge performance may be a bit slower. It could still be backward compatible, but old images will still propagate errors into the new design and there is nothing that can be done about that. Users that record their history with the new design will see much better error performance.
The advantage of this is (hopefully) a single code-base for the sketch internals with two different wrappers, one for the old Druid-HLL users, and a new full-function API wrapper for the DS-HLL customers. Whatever, the new design would have to be extensively characterized and tested. Thoughts? |
|
@gianm You are correct, the prior impl made some assumptions for a dense buffer, but we had some cases where it came in sparse. So this PR checks for such a case and densifies as needed. It also adds some test cases for some "easy to accidentally hit" scenarios which are in the realm of what @leerho is talking about as far as challenges with the default Druid HLL implementation. |
|
I have a number of concerns about this PR that I am still investigating. Please don't merge this yet. |
|
@leerho IMHO if data sketches are intended to be enterprise sketch support library, then I think the effort put into fixing the druid HLL sketch library should be minimal. I know we have effort internally to fix some of the accuracy issues in the existing HLL cardinality estimations. Beyond some high level fixes, if I had to choose between my team investing in fixing HLL fixes in the data format level, or finding a way to validate Data Sketches for enterprise sketching needs, I'd rather have time spent on Data Sketches validation and adoption since it has applications and impact outside of just Druid. |
|
Also please keep in mind this PR is not trying to make things perfect, just "better" |
|
I'm also worried #6865 (comment) won't reach enough audience here. @leerho if you don't have the info you need from this thread, posting such information and insights to the dev list would likely reach a more diverse audience. |
clintropolis
left a comment
There was a problem hiding this comment.
bugfix LGTM 👍
I suggest we move the larger discussion about replacing hll with datasketches hll should be moved to an issue or the dev list (or both)
|
#6814 might be the most appropriate place to continue discussion? |
| * This is a very long running test, disabled by default. | ||
| * It is meant to catch issues when combining a large numer of HLL objects. | ||
| * | ||
| * <p> |
|
@leerho are your concerns specific to the fix this PR is doing or the Druid hll implementation in general? This seems like it's fixing an oversight in the original implementation, since it was already converting itself to dense representation if it wasn't, but wasn't checking the |
|
Sorry about the delay, I was pulled off onto other problems :) This code has so many serious issues that I hardly know where to start. If we are going to submit fixes, then I would suggest that we fix an issue throughout the code and not just in one place.
In your check you compare (and in other parts the author does this as well ...) The They are different! NUM_BYTES_FOR_BUCKETS = 1024, while getNumBytesForDenseStorage() = 1031. The author should have used It is fragile because the remaining size of the buffer relies on the current state of the buffer position and limit. Yikes! And One of the tests you added demonstrates this. In The reason it returns zero is because the toByteBuffer() returns a sparse representation when it should be dense, and estimateCardinality() of sparse uses the linear (or Poisson) estimator which is These issues could be fixed, but it will require a version change and a lot of careful coding. I have more, but I have to go now. |
|
@drcrallen @clintropolis
The returned duplicated read-only buffer is never captured or used.
The The HLL sketch is a complex state machine, and if implemented correctly, should never allow itself to be placed in illegal states especially from public methods. It is hard enough to validate that the state machine operates correctly, but placing them in illegal states and then expecting meaningful results is asking for trouble. You are not the only one to fall into this trap, clearly the author did too.
Nonetheless, the real issue to warn users about is to never ever use This goes back to what I have said before, that the hashing operation should never have been exposed and delegated to the user. But that is history. In conclusion
|
|
@leerho Thanks for looking so deeply into this and for the detailed explanation! I am not very familiar with this area of the code and am definitely guilty of reviewing this change in a superficial manner without verifying that the
If I understand your analysis correctly, rather than changing nothing, if the Other thoughts:
I don't think this is actually a no-op, It sounds to me like there are too many design flaws in the hll collector as is to make it worth fixing in terms of correctness, since fixing sounds like would break compatibility anyway, so I think we should probably focus on just trying to get users away from using it. But I do think it is worth addressing implementation level things like this in some manner, either allowing design flaws to function as intended, however flawed they may be, or throwing sensible exceptions so users can at least know they have hit the limits of the capabilities of the algorithm, so issues like this don't appear as legitimate bugs in the implementation. |
|
Thank you for your thoughtful comments. I looked again at the “noop” line ... and you are right ... I missed it! But it is a convoluted statement that would have benefited from a code comment! I also agree that throwing an exception would make a lot of sense. That may cause some of the tests that take advantage of the ability to fill the Sketch with illegal sequences to fail. They would have to be rewritten. I’m not sure, you would have to try it and see. An ISE would be fine, as long as the message indicates that the user of the Sketch has done something seriously wrong be feeding the Sketch with an illegal (astronomically unlikely) sequence. This Sketch implementation is highly vulnerable to accidental corruption because the default hash seed used by the aggregator is zero. And most people when using hash functions never bother with a seed. So anybody using the Murmur3_128() in other parts of their code may inadvertently create correlation corruption effects with this Sketch. And there is no warning! Unfortunately, I don’t know of any way to detect this or fix it in a compatible way WRT historically stored sketches. |
That is very true! But never underestimate the odd things that happen in production, where the things you thought were properly done turn out not to be so. |
Or have poor controls in for handling upstream data sampling. |
That is true as well, but I'm not trying to fix the hashing in druid in this PR. Just rectify one specific assumption (always dense) that is not guaranteed to be true for arbitrary data input (fails when input is improperly distributed). |
|
Hmm, so are we stuck here? I haven't had enough time to spare to dig in deep enough to the hll collector to have a strong opinion of my own, so my main concern is that we don't fail with a Whether that means a friendly error for a case that isn't supposed to happen, or simply handling that case that apparently can happen in practice under certain circumstances, seems to be the point of contention? Code-wise, the solution @drcrallen has doesn't look wrong to me, even if it's a design flaw that this could happen, but it sounds like the solution @leerho suggests is maybe the most correct in terms of .. correctness? Anyone else have any opinions here? |
|
If the solutions were presented which fixed up items related to the core HLL aggregator itself. This part of the code still would have the assumption that the buffer is densified. I do not have full coverage for all test cases where it may or may not have dense buffers at this code point. So there may in the future be other cases where it is "normal". The workflow assumes dense so I added a check to rectify when that is not the case. Fixes to the overall HLL algorithm itself are beyond the scope of this PR, and presenting HLL results which "don't make good sense" because of bad input values is not solved by this PR. Meaning if someone puts in values that do not encounter this error case, but still violate some of the (sparsely documented) design constraints of the HLL aggregator, then it will STILL present nonsense values even IF this particular code path fix is not hit. In other words NOT hitting this error case does NOT mean the HLL approximation is GOOD, just that it didn't crash. Therefore I argue for keeping the code from crashing, thereby making ALL poorly constructed input values simply return whatever the algorithm can calculate. |
|
For longer term fixes, I'm not even sure if documenting the existing HLL design is worth it. For example, simply saying "Legacy HLL implementation has a lot of rough corner cases that are handled by its successor Data Sketches HLL" or similar. |
|
the problem with the sparseness checks is that sometimes the buffer has a read only with the headers, and sometimes it is a read only without the headers, and that changes which check is appropriate. The legacy of which one is used is very hard to track in the current design. The check as presented here is the correct check for this specific spot in the code |
|
@drcrallen I spent some time last night trying to sift through the hll code to have a stronger opinion, and I think I agree with you. I suspect anyone that has been using this is maybe already conditioned to expect to have occasionally wonky results, the flaw here feels maybe more like a design flaw to me at this time, and I do think energy would be better spent getting people on datasketches hll instead of trying to fix this. I'm going to merge after CI so we can add to 0.14 👍 @leerho if you strongly disagree with this position, we can open an issue to track further fixes or modifications to this hll algorithm. |
The 0.14.0 docs will deprecate the old HLL agg and point users to the DataSketches HLL instead: #7195 I'm okay with this PR being merged for 0.14.0, I think "avoid crashes in old HLL, use datasketches HLL for proper results" is a reasonable approach |
* Densify swapped hll buffer * Make test loop limit pre-increment * Reformat * Fix test comments
|
@drcrallen |
The legacy druid HLL module assumes the buffers are dense when doing a fold operation. This is not always the case and can lead to a BufferOverfloException. This PR checks to see if any densification needs to happen.
We had an upstream data producer who was sampling data. The sampling algorithm seemed to be based on Murmur3_128, or at least a related algorithm where the hash collisions were similar. When doing a HLL sketch of the dimension values, we were getting really weird results where all the HLL buckets would end up with values that were not good sketches of the input data (every bucket nibble with a
1for example).testCanFillUpOnModdemonstrates such a scenario.The unfortunate side effect of this was that the folding operation can easily cause corrupt buffers if the buffer folding in is sparse.
testRegisterSwapWithSparsewill fail against master atfolded.toByteBuffer()similar to how the jackson serialization of the collector fails on historicals in the error mode we found.With this PR applied, the query result does not crash, but does return as sketch that is useless, as demonstrated in the estimate cardinality checks during the added unit tests.
A tangential long term solution here would probably be to also seed the murmur hash with a custom value... but that will break historical compatibility in nasty ways.