Add "stringEncoding" parameter to DataSketches HLL.#11201
Add "stringEncoding" parameter to DataSketches HLL.#11201gianm merged 17 commits intoapache:masterfrom
Conversation
Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches.
|
This pull request introduces 1 alert when merging e27cc1c into 99f39c7 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 492edd3 into 34169c8 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 0d7d9aa into 34169c8 - view on LGTM.com new alerts:
|
|
@gianm
This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well. The sketch really doesn't care what the string encoding is, it is either looking at the input as a stream of byte[] or char[]. The UTF-8 encoding was specified in the string update method to help users ensure consistency (if the string happened to be encoded in something else). Nonetheless, whatever you decide, you will always need to stick with your choice. Otherwise, you will destroy the unique identity of whatever you are feeding the sketch. As a result counts, merging, etc will be meaningless! I have some comments about PR 353 but I want to make these in the actual PR. Lee. |
@leerho Understood, but it is true as far as Druid is concerned — the HllSketch-based aggregator implementation in Druid does
Yep, that's why this must be an option and the choice needs to be made in a consistent way.
Thanks! |
| /** | ||
| * An enum that provides a way for users to specify what encoding should be used when hashing strings. | ||
| * | ||
| * The main reason for thsi setting's existence is getting the best performance possible. When operating on memory |
There was a problem hiding this comment.
| * The main reason for thsi setting's existence is getting the best performance possible. When operating on memory | |
| * The main reason for this setting's existence is getting the best performance possible. When operating on memory |
| @Override | ||
| public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset) | ||
| { | ||
| final Object[] vector = selector.getObjectVector(); |
There was a problem hiding this comment.
rows is not used in this function for finding the vector offset.
There was a problem hiding this comment.
Ah, good catch, I'll figure out why the tests didn't catch it. I think we might need to use a filtered aggregator.
There was a problem hiding this comment.
This makes me think it'd be good to have a standard aggregator test harness, like we have with BaseFilterTest for filters (that makes sure filters are exercised in all the ways they can be called).
There was a problem hiding this comment.
Fixed and added a test case using filtered aggregators.
| sketch, | ||
| stringEncoding, | ||
| selector, | ||
| vector[i] |
There was a problem hiding this comment.
similar comment here about rows
There was a problem hiding this comment.
Fixed and added a test case using filtered aggregators.
|
Pushed this stuff:
|
|
This pull request introduces 1 alert when merging 53a708f into 44a7b09 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging f7ff8d4 into 44a7b09 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging dc373eb into 223c569 - view on LGTM.com new alerts:
|
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request/issue is no longer marked as stale. |
|
Resolved conflicts with master. |
|
This pull request introduces 1 alert when merging dacd05e into 7ab2170 - view on LGTM.com new alerts:
|
rohangarg
left a comment
There was a problem hiding this comment.
Thanks a lot for the changes @gianm! This would be very useful in all standard use-cases for HLL queries. Possibly, the default distinct aggregator should also be moved to HLL one in future.
I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.
Maybe the druid expression for sketch column could be inspected at planning time to determine the encoding to use (atleast for the cases when the sketch is fully built query time).
| .appendString(fieldName) | ||
| .appendInt(lgK) | ||
| .appendInt(tgtHllType.ordinal()) | ||
| .appendInt(stringEncoding.ordinal()) |
There was a problem hiding this comment.
should we use stringEncoding.getCacheKey() instead?
There was a problem hiding this comment.
Ahh, we should just do appendCacheable(stringEncoding). I'll change it.
| final int id | ||
| ) | ||
| { | ||
| if (stringEncoding == StringEncoding.UTF8 && selector.supportsLookupNameUtf8()) { |
There was a problem hiding this comment.
should we do this once per aggregator object itself? probably the branch predictor will optimize it eventually, but it may take time.
There was a problem hiding this comment.
I think we need to rely on the branch predictor. If we lift this up to the aggregator itself, then it would still need to have a conditional when deciding which method to call in HllSketchBuildUtil, unless we make different aggregator classes for UTF8 vs UTF16LE, which seems overkill. LMK if I am missing something.
| } | ||
| } else { | ||
| final String s = NullHandling.nullToEmptyIfNeeded(selector.lookupName(id)); | ||
| updateSketchWithString(sketch, stringEncoding, s); |
There was a problem hiding this comment.
should we add a !NullHandling.isNullOrEquivalent(asString) check as well? The earlier code seems to check that for all objects.
There was a problem hiding this comment.
I think it doesn't matter, because HllSketch itself ignores both null and empty arrays / empty strings.
Given that, I will remove the call here to nullToEmptyIfNeeded, since it doesn't affect the outcome.
| * anyway so we can preserve enough information to ensure that we are merging sketches in a valid way. (Sketches with | ||
| * incompatible string encodings cannot be merged meaningfully.) Currently, the only way this is exposed is through | ||
| * {@link #getMergingFactory}, which will throw {@link AggregatorFactoryNotMergeableException} if presented with | ||
| * two aggregators with two different encodings. |
There was a problem hiding this comment.
Currently, I think there is no way to detect is someone is accidentaly merging different encoding sketches at query time (let's say some sketches were ingested as e1 and then at querying they got merged with encoding e2 sketches). Probably, that will require augmented state in the aggregator serialization itself which is again hard to do wrt compat.
There was a problem hiding this comment.
Yes, that's true, there's no real way to discover this. That's a big part of why the functionality is undocumented in this patch. I think we need to think a bit through how to expose it in a way that is safe for users. However, I do want to get it in, because having the functionality available will allow us to do further experimentation.
|
|
||
| for (int i = 0; i < numRows; i++) { | ||
| final int idx = rows != null ? rows[i] : i; | ||
| if (NullHandling.replaceWithDefault() || nullVector == null || !nullVector[idx]) { |
There was a problem hiding this comment.
having a separate loop for NullHandling.replaceWithDefault() || nullVector == null which doesn't check anything might be faster. but it could be considered later
There was a problem hiding this comment.
I figured NullHandling.replaceWithDefault() always has the same value, so the JVM would be good at skipping it or at least making it efficient. Agree that it could be analyzed further to confirm this. I'd rather that be done in a follow up though.
| ) | ||
| ); | ||
|
|
||
| private static final List<AggregatorFactory> EXPECTED_PA_AGGREGATORS = |
There was a problem hiding this comment.
It refers to the test names where this is used, which have "postAggs" in them. I added a javadoc explaining what this is for.
| { | ||
| public static void updateSketch(final HllSketch sketch, final StringEncoding stringEncoding, final Object value) | ||
| { | ||
| if (value instanceof Integer || value instanceof Long) { |
Check notice
Code scanning / CodeQL
Chain of 'instanceof' tests
| * exclusions (see spotbugs-exclude.xml). | ||
| */ | ||
| @SuppressWarnings({"EqualsAndHashcode", "EqualsHashCode", "EqualsWhichDoesntCheckParameterClass"}) | ||
| public class StringEncodingDefaultUTF16LEJsonIncludeFilter // lgtm [java/inconsistent-equals-and-hashcode] |
Check failure
Code scanning / CodeQL
Inconsistent equals and hashCode
|
All tests passed except |
* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.
Builds on the concept from #11172 and adds a way to feed HLL sketches
with UTF-8 bytes.
This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.
Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.
To enable experimentation, I added a currently-undocumented SQL
aggregator,
APPROX_COUNT_DISTINCT_DS_HLL_UTF8. Hopefullywe will be able to fold this in to
APPROX_COUNT_DISTINCT_DS_HLLin a way that is safe for users.
Benchmarks
These are benchmarks of various approx count distinct functions on a 5,000,000 row segment. The values that appear are generally short (a few characters long).
Testing strategy
Added specific sketch image expectations to HllSketchAggregatorTest and manually verified them against master, to ensure consistency across versions.Took this out, since something changed since this PR was originally created such that sketch images are not reliably consistent across different query engines (timeseries, groupBy v1, groupBy v2). I'm not sure why not. However, I figured these specific-sketch-image tests are too brittle anyway and therefore shouldn't be included.