Add "stringEncoding" parameter to DataSketches HLL. by gianm · Pull Request #11201 · apache/druid

gianm · 2021-05-05T02:04:31Z

Builds on the concept from #11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

To enable experimentation, I added a currently-undocumented SQL
aggregator, APPROX_COUNT_DISTINCT_DS_HLL_UTF8. Hopefully
we will be able to fold this in to APPROX_COUNT_DISTINCT_DS_HLL
in a way that is safe for users.

Benchmarks

These are benchmarks of various approx count distinct functions on a 5,000,000 row segment. The values that appear are generally short (a few characters long).

APPROX_COUNT_DISTINCT_BUILTIN

SqlBenchmark.querySql       28           5000000           mmap              none        force  avgt    5  732.455 ± 120.374  ms/op
SqlBenchmark.querySql       28           5000000           mmap     front-coded-4        force  avgt    5  561.300 ±  20.598  ms/op

APPROX_COUNT_DISTINCT_DS_HLL

SqlBenchmark.querySql       29           5000000           mmap              none        force  avgt    5  434.056 ±  16.938  ms/op
SqlBenchmark.querySql       29           5000000           mmap     front-coded-4        force  avgt    5  450.413 ±  17.266  ms/op

APPROX_COUNT_DISTINCT_DS_HLL_UTF8

SqlBenchmark.querySql       30           5000000           mmap              none        force  avgt    5  180.603 ±   3.647  ms/op
SqlBenchmark.querySql       30           5000000           mmap     front-coded-4        force  avgt    5  188.436 ±   2.007  ms/op

APPROX_COUNT_DISTINCT_DS_THETA

SqlBenchmark.querySql       31           5000000           mmap              none        force  avgt    5  457.788 ± 107.166  ms/op
SqlBenchmark.querySql       31           5000000           mmap     front-coded-4        force  avgt    5  468.164 ±  29.853  ms/op

Testing strategy

Extended HllSketchAggregatorTest to run on a matrix of (vectorized, nonvectorized) x (utf8, utf16le).
~~Added specific sketch image expectations to HllSketchAggregatorTest and manually verified them against master, to ensure consistency across versions.~~ Took this out, since something changed since this PR was originally created such that sketch images are not reliably consistent across different query engines (timeseries, groupBy v1, groupBy v2). I'm not sure why not. However, I figured these specific-sketch-image tests are too brittle anyway and therefore shouldn't be included.
Added serde tests to make sure the new field JSON-round-trips properly.

Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches.

lgtm-com · 2021-05-05T03:39:13Z

This pull request introduces 1 alert when merging e27cc1c into 99f39c7 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

lgtm-com · 2021-05-05T18:57:48Z

This pull request introduces 1 alert when merging 492edd3 into 34169c8 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

lgtm-com · 2021-05-06T00:19:48Z

This pull request introduces 1 alert when merging 0d7d9aa into 34169c8 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

leerho · 2021-05-07T22:40:28Z

@gianm
I want to clarify a comment made above.

HLL sketches used UTF-16LE encoding when hashing strings.

This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well. The sketch really doesn't care what the string encoding is, it is either looking at the input as a stream of byte[] or char[]. The UTF-8 encoding was specified in the string update method to help users ensure consistency (if the string happened to be encoded in something else). Nonetheless, whatever you decide, you will always need to stick with your choice. Otherwise, you will destroy the unique identity of whatever you are feeding the sketch. As a result counts, merging, etc will be meaningless!

I have some comments about PR 353 but I want to make these in the actual PR.

Lee.

gianm · 2021-05-10T17:16:40Z

This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well.

@leerho Understood, but it is true as far as Druid is concerned — the HllSketch-based aggregator implementation in Druid does update(s.toCharArray()) not update(s):

druid/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/HllSketchBuildAggregator.java

Line 103 in 8296123

sketch.update(((String) value).toCharArray());

Nonetheless, whatever you decide, you will always need to stick with your choice.

Yep, that's why this must be an option and the choice needs to be made in a consistent way.

I have some comments about PR 353 but I want to make these in the actual PR.

Thanks!

abhishekagarwal87 · 2021-05-24T05:42:08Z

+/**
+ * An enum that provides a way for users to specify what encoding should be used when hashing strings.
+ *
+ * The main reason for thsi setting's existence is getting the best performance possible. When operating on memory


Suggested change

* The main reason for thsi setting's existence is getting the best performance possible. When operating on memory

* The main reason for this setting's existence is getting the best performance possible. When operating on memory

abhishekagarwal87 · 2021-05-24T06:00:37Z

+  @Override
+  public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
+  {
+    final Object[] vector = selector.getObjectVector();


rows is not used in this function for finding the vector offset.

Ah, good catch, I'll figure out why the tests didn't catch it. I think we might need to use a filtered aggregator.

This makes me think it'd be good to have a standard aggregator test harness, like we have with BaseFilterTest for filters (that makes sure filters are exercised in all the ways they can be called).

Fixed and added a test case using filtered aggregators.

abhishekagarwal87 · 2021-05-24T06:02:38Z

+          sketch,
+          stringEncoding,
+          selector,
+          vector[i]


similar comment here about rows

Fixed and added a test case using filtered aggregators.

gianm · 2021-10-23T17:52:14Z

Pushed this stuff:

fixes for the issues @abhishekagarwal87 found
update to use the new API from Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch. datasketches-java#353
improve test coverage for updateSketch and updateSketchWithDictionarySelector

lgtm-com · 2021-10-23T18:59:34Z

This pull request introduces 1 alert when merging 53a708f into 44a7b09 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

lgtm-com · 2021-10-24T01:05:29Z

This pull request introduces 1 alert when merging f7ff8d4 into 44a7b09 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

lgtm-com · 2021-11-11T17:04:43Z

This pull request introduces 1 alert when merging dc373eb into 223c569 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

stale · 2022-04-19T05:56:04Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2022-05-14T21:11:34Z

This pull request/issue is no longer marked as stale.

gianm · 2022-05-14T21:12:33Z

Resolved conflicts with master.

lgtm-com · 2022-05-15T00:36:21Z

This pull request introduces 1 alert when merging dacd05e into 7ab2170 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

rohangarg

Thanks a lot for the changes @gianm! This would be very useful in all standard use-cases for HLL queries. Possibly, the default distinct aggregator should also be moved to HLL one in future.

I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

Maybe the druid expression for sketch column could be inspected at planning time to determine the encoding to use (atleast for the cases when the sketch is fully built query time).

rohangarg · 2022-10-10T15:15:06Z

+        .appendString(fieldName)
+        .appendInt(lgK)
+        .appendInt(tgtHllType.ordinal())
+        .appendInt(stringEncoding.ordinal())


should we use stringEncoding.getCacheKey() instead?

Ahh, we should just do appendCacheable(stringEncoding). I'll change it.

rohangarg · 2022-10-10T17:11:31Z

+      final int id
+  )
+  {
+    if (stringEncoding == StringEncoding.UTF8 && selector.supportsLookupNameUtf8()) {


should we do this once per aggregator object itself? probably the branch predictor will optimize it eventually, but it may take time.

I think we need to rely on the branch predictor. If we lift this up to the aggregator itself, then it would still need to have a conditional when deciding which method to call in HllSketchBuildUtil, unless we make different aggregator classes for UTF8 vs UTF16LE, which seems overkill. LMK if I am missing something.

rohangarg · 2022-10-10T17:16:22Z

+      }
+    } else {
+      final String s = NullHandling.nullToEmptyIfNeeded(selector.lookupName(id));
+      updateSketchWithString(sketch, stringEncoding, s);


should we add a !NullHandling.isNullOrEquivalent(asString) check as well? The earlier code seems to check that for all objects.

I think it doesn't matter, because HllSketch itself ignores both null and empty arrays / empty strings.

Given that, I will remove the call here to nullToEmptyIfNeeded, since it doesn't affect the outcome.

rohangarg · 2022-10-10T17:26:17Z

+ * anyway so we can preserve enough information to ensure that we are merging sketches in a valid way. (Sketches with
+ * incompatible string encodings cannot be merged meaningfully.) Currently, the only way this is exposed is through
+ * {@link #getMergingFactory}, which will throw {@link AggregatorFactoryNotMergeableException} if presented with
+ * two aggregators with two different encodings.


Currently, I think there is no way to detect is someone is accidentaly merging different encoding sketches at query time (let's say some sketches were ingested as e1 and then at querying they got merged with encoding e2 sketches). Probably, that will require augmented state in the aggregator serialization itself which is again hard to do wrt compat.

Yes, that's true, there's no real way to discover this. That's a big part of why the functionality is undocumented in this patch. I think we need to think a bit through how to expose it in a way that is safe for users. However, I do want to get it in, because having the functionality available will allow us to do further experimentation.

rohangarg · 2022-10-20T17:49:07Z

+
+    for (int i = 0; i < numRows; i++) {
+      final int idx = rows != null ? rows[i] : i;
+      if (NullHandling.replaceWithDefault() || nullVector == null || !nullVector[idx]) {


having a separate loop for NullHandling.replaceWithDefault() || nullVector == null which doesn't check anything might be faster. but it could be considered later

I figured NullHandling.replaceWithDefault() always has the same value, so the JVM would be good at skipping it or at least making it efficient. Agree that it could be analyzed further to confirm this. I'd rather that be done in a follow up though.

rohangarg · 2022-10-20T18:03:08Z

+          )
+      );
+
+  private static final List<AggregatorFactory> EXPECTED_PA_AGGREGATORS =


what does PA signify here?

It refers to the test names where this is used, which have "postAggs" in them. I added a javadoc explaining what this is for.

+{
+  public static void updateSketch(final HllSketch sketch, final StringEncoding stringEncoding, final Object value)
+  {
+    if (value instanceof Integer || value instanceof Long) {


+ * exclusions (see spotbugs-exclude.xml).
+ */
+@SuppressWarnings({"EqualsAndHashcode", "EqualsHashCode", "EqualsWhichDoesntCheckParameterClass"})
+public class StringEncodingDefaultUTF16LEJsonIncludeFilter // lgtm [java/inconsistent-equals-and-hashcode]


clintropolis

latest changes lgtm 👍

gianm · 2023-06-30T19:45:50Z

All tests passed except standard-its / (Compile=openjdk8, Run=openjdk8, Cluster Build On K8s) ITNestedQueryPushDownTest integration test, which has been weird in other PRs as well, and I don't believe would be related to the code in this patch. So, I'll merge it.

* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.

gianm added Performance Area - Querying labels May 5, 2021

Fix benchmark.

e27cc1c

Fix style issues, improve test coverage.

492edd3

gianm mentioned this pull request May 5, 2021

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch. apache/datasketches-java#353

Merged

gianm added 2 commits May 5, 2021 16:07

Put round back, to make IT updates easier.

c43d99e

Fix test.

0d7d9aa

abhishekagarwal87 reviewed May 24, 2021

View reviewed changes

gianm added 3 commits October 23, 2021 10:00

Merge branch 'master' into query-datasketches-hll-utf8

a466ce0

Fix issue with filtered aggregators and add test.

f26dc56

Use DS native update(ByteBuffer) method. Improve test coverage.

53a708f

Add another suppression.

f7ff8d4

Merge branch 'master' into query-datasketches-hll-utf8

dc373eb

stale Bot added the stale label Apr 19, 2022

Merge branch 'master' into query-datasketches-hll-utf8

e88de93

stale Bot removed the stale label May 14, 2022

Fix ITAutoCompactionTest.

dacd05e

rohangarg reviewed Oct 20, 2022

View reviewed changes

gianm added 4 commits June 29, 2023 16:35

Merge branch 'master' into query-datasketches-hll-utf8

87997ca

Update benchmarks.

0d4674f

Updates.

479519d

Fix conflict.

7ebd1e4

gianm mentioned this pull request Jun 30, 2023

remove druid.processing.columnCache.sizeBytes and CachingIndexed, combine string column implementations #14500

Merged

6 tasks

github-advanced-security AI found potential problems Jun 30, 2023

View reviewed changes

clintropolis approved these changes Jun 30, 2023

View reviewed changes

Adjustments.

784cb4c

gianm merged commit 67fbd8e into apache:master Jun 30, 2023

gianm deleted the query-datasketches-hll-utf8 branch June 30, 2023 19:45

gianm mentioned this pull request Jul 10, 2023

Refactor HllSketchBuildAggregatorFactory #14544

Merged

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

	* The main reason for thsi setting's existence is getting the best performance possible. When operating on memory
	* The main reason for this setting's existence is getting the best performance possible. When operating on memory

Conversation

gianm commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com Bot commented May 5, 2021

Uh oh!

lgtm-com Bot commented May 5, 2021

Uh oh!

lgtm-com Bot commented May 6, 2021

Uh oh!

leerho commented May 7, 2021

Uh oh!

gianm commented May 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Oct 23, 2021

Uh oh!

lgtm-com Bot commented Oct 23, 2021

Uh oh!

lgtm-com Bot commented Oct 24, 2021

Uh oh!

lgtm-com Bot commented Nov 11, 2021

Uh oh!

stale Bot commented Apr 19, 2022

Uh oh!

stale Bot commented May 14, 2022

Uh oh!

gianm commented May 14, 2022

Uh oh!

lgtm-com Bot commented May 15, 2022

Uh oh!

rohangarg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Check notice

Check failure

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

gianm commented Jun 30, 2023

Uh oh!

Reviewers

Assignees

gianm commented May 5, 2021 •

edited

Loading