Skip to content

Add "stringEncoding" parameter to DataSketches HLL.#11201

Merged
gianm merged 17 commits intoapache:masterfrom
gianm:query-datasketches-hll-utf8
Jun 30, 2023
Merged

Add "stringEncoding" parameter to DataSketches HLL.#11201
gianm merged 17 commits intoapache:masterfrom
gianm:query-datasketches-hll-utf8

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented May 5, 2021

Builds on the concept from #11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

To enable experimentation, I added a currently-undocumented SQL
aggregator, APPROX_COUNT_DISTINCT_DS_HLL_UTF8. Hopefully
we will be able to fold this in to APPROX_COUNT_DISTINCT_DS_HLL
in a way that is safe for users.

Benchmarks

These are benchmarks of various approx count distinct functions on a 5,000,000 row segment. The values that appear are generally short (a few characters long).

APPROX_COUNT_DISTINCT_BUILTIN

SqlBenchmark.querySql       28           5000000           mmap              none        force  avgt    5  732.455 ± 120.374  ms/op
SqlBenchmark.querySql       28           5000000           mmap     front-coded-4        force  avgt    5  561.300 ±  20.598  ms/op

APPROX_COUNT_DISTINCT_DS_HLL

SqlBenchmark.querySql       29           5000000           mmap              none        force  avgt    5  434.056 ±  16.938  ms/op
SqlBenchmark.querySql       29           5000000           mmap     front-coded-4        force  avgt    5  450.413 ±  17.266  ms/op

APPROX_COUNT_DISTINCT_DS_HLL_UTF8

SqlBenchmark.querySql       30           5000000           mmap              none        force  avgt    5  180.603 ±   3.647  ms/op
SqlBenchmark.querySql       30           5000000           mmap     front-coded-4        force  avgt    5  188.436 ±   2.007  ms/op

APPROX_COUNT_DISTINCT_DS_THETA

SqlBenchmark.querySql       31           5000000           mmap              none        force  avgt    5  457.788 ± 107.166  ms/op
SqlBenchmark.querySql       31           5000000           mmap     front-coded-4        force  avgt    5  468.164 ±  29.853  ms/op

Testing strategy

  • Extended HllSketchAggregatorTest to run on a matrix of (vectorized, nonvectorized) x (utf8, utf16le).
  • Added specific sketch image expectations to HllSketchAggregatorTest and manually verified them against master, to ensure consistency across versions. Took this out, since something changed since this PR was originally created such that sketch images are not reliably consistent across different query engines (timeseries, groupBy v1, groupBy v2). I'm not sure why not. However, I figured these specific-sketch-image tests are too brittle anyway and therefore shouldn't be included.
  • Added serde tests to make sure the new field JSON-round-trips properly.

Builds on the concept from apache#11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.
@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented May 5, 2021

This pull request introduces 1 alert when merging e27cc1c into 99f39c7 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented May 5, 2021

This pull request introduces 1 alert when merging 492edd3 into 34169c8 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented May 6, 2021

This pull request introduces 1 alert when merging 0d7d9aa into 34169c8 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

@leerho
Copy link
Copy Markdown
Member

leerho commented May 7, 2021

@gianm
I want to clarify a comment made above.

HLL sketches used UTF-16LE encoding when hashing strings.

This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well. The sketch really doesn't care what the string encoding is, it is either looking at the input as a stream of byte[] or char[]. The UTF-8 encoding was specified in the string update method to help users ensure consistency (if the string happened to be encoded in something else). Nonetheless, whatever you decide, you will always need to stick with your choice. Otherwise, you will destroy the unique identity of whatever you are feeding the sketch. As a result counts, merging, etc will be meaningless!

I have some comments about PR 353 but I want to make these in the actual PR.

Lee.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented May 10, 2021

This is not correct, at least for the HLL in datasketches-java (I'm not sure what the Druid adaptor does). Strings are encoded using UTF-8 and have been for as long as I can remember. If you wish to use UTF-16, you just convert your string to char[] and the HLL sketch will accept that as well.

@leerho Understood, but it is true as far as Druid is concerned — the HllSketch-based aggregator implementation in Druid does update(s.toCharArray()) not update(s):

Nonetheless, whatever you decide, you will always need to stick with your choice.

Yep, that's why this must be an option and the choice needs to be made in a consistent way.

I have some comments about PR 353 but I want to make these in the actual PR.

Thanks!

/**
* An enum that provides a way for users to specify what encoding should be used when hashing strings.
*
* The main reason for thsi setting's existence is getting the best performance possible. When operating on memory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The main reason for thsi setting's existence is getting the best performance possible. When operating on memory
* The main reason for this setting's existence is getting the best performance possible. When operating on memory

@Override
public void aggregate(ByteBuffer buf, int numRows, int[] positions, @Nullable int[] rows, int positionOffset)
{
final Object[] vector = selector.getObjectVector();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rows is not used in this function for finding the vector offset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, I'll figure out why the tests didn't catch it. I think we might need to use a filtered aggregator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think it'd be good to have a standard aggregator test harness, like we have with BaseFilterTest for filters (that makes sure filters are exercised in all the ways they can be called).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and added a test case using filtered aggregators.

sketch,
stringEncoding,
selector,
vector[i]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here about rows

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and added a test case using filtered aggregators.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Oct 23, 2021

Pushed this stuff:

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Oct 23, 2021

This pull request introduces 1 alert when merging 53a708f into 44a7b09 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Oct 24, 2021

This pull request introduces 1 alert when merging f7ff8d4 into 44a7b09 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 11, 2021

This pull request introduces 1 alert when merging dc373eb into 223c569 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

@stale
Copy link
Copy Markdown

stale Bot commented Apr 19, 2022

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label Apr 19, 2022
@stale
Copy link
Copy Markdown

stale Bot commented May 14, 2022

This pull request/issue is no longer marked as stale.

@stale stale Bot removed the stale label May 14, 2022
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented May 14, 2022

Resolved conflicts with master.

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented May 15, 2022

This pull request introduces 1 alert when merging dacd05e into 7ab2170 - view on LGTM.com

new alerts:

  • 1 for Inconsistent equals and hashCode

Copy link
Copy Markdown
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes @gianm! This would be very useful in all standard use-cases for HLL queries. Possibly, the default distinct aggregator should also be moved to HLL one in future.

I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

Maybe the druid expression for sketch column could be inspected at planning time to determine the encoding to use (atleast for the cases when the sketch is fully built query time).

.appendString(fieldName)
.appendInt(lgK)
.appendInt(tgtHllType.ordinal())
.appendInt(stringEncoding.ordinal())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use stringEncoding.getCacheKey() instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, we should just do appendCacheable(stringEncoding). I'll change it.

final int id
)
{
if (stringEncoding == StringEncoding.UTF8 && selector.supportsLookupNameUtf8()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do this once per aggregator object itself? probably the branch predictor will optimize it eventually, but it may take time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rely on the branch predictor. If we lift this up to the aggregator itself, then it would still need to have a conditional when deciding which method to call in HllSketchBuildUtil, unless we make different aggregator classes for UTF8 vs UTF16LE, which seems overkill. LMK if I am missing something.

}
} else {
final String s = NullHandling.nullToEmptyIfNeeded(selector.lookupName(id));
updateSketchWithString(sketch, stringEncoding, s);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a !NullHandling.isNullOrEquivalent(asString) check as well? The earlier code seems to check that for all objects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't matter, because HllSketch itself ignores both null and empty arrays / empty strings.

Given that, I will remove the call here to nullToEmptyIfNeeded, since it doesn't affect the outcome.

* anyway so we can preserve enough information to ensure that we are merging sketches in a valid way. (Sketches with
* incompatible string encodings cannot be merged meaningfully.) Currently, the only way this is exposed is through
* {@link #getMergingFactory}, which will throw {@link AggregatorFactoryNotMergeableException} if presented with
* two aggregators with two different encodings.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I think there is no way to detect is someone is accidentaly merging different encoding sketches at query time (let's say some sketches were ingested as e1 and then at querying they got merged with encoding e2 sketches). Probably, that will require augmented state in the aggregator serialization itself which is again hard to do wrt compat.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, there's no real way to discover this. That's a big part of why the functionality is undocumented in this patch. I think we need to think a bit through how to expose it in a way that is safe for users. However, I do want to get it in, because having the functionality available will allow us to do further experimentation.


for (int i = 0; i < numRows; i++) {
final int idx = rows != null ? rows[i] : i;
if (NullHandling.replaceWithDefault() || nullVector == null || !nullVector[idx]) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a separate loop for NullHandling.replaceWithDefault() || nullVector == null which doesn't check anything might be faster. but it could be considered later

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured NullHandling.replaceWithDefault() always has the same value, so the JVM would be good at skipping it or at least making it efficient. Agree that it could be analyzed further to confirm this. I'd rather that be done in a follow up though.

)
);

private static final List<AggregatorFactory> EXPECTED_PA_AGGREGATORS =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does PA signify here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It refers to the test names where this is used, which have "postAggs" in them. I added a javadoc explaining what this is for.

{
public static void updateSketch(final HllSketch sketch, final StringEncoding stringEncoding, final Object value)
{
if (value instanceof Integer || value instanceof Long) {

Check notice

Code scanning / CodeQL

Chain of 'instanceof' tests

This if block performs a chain of 6 type tests - consider alternatives, e.g. polymorphism or the visitor pattern.
* exclusions (see spotbugs-exclude.xml).
*/
@SuppressWarnings({"EqualsAndHashcode", "EqualsHashCode", "EqualsWhichDoesntCheckParameterClass"})
public class StringEncodingDefaultUTF16LEJsonIncludeFilter // lgtm [java/inconsistent-equals-and-hashcode]

Check failure

Code scanning / CodeQL

Inconsistent equals and hashCode

Class StringEncodingDefaultUTF16LEJsonIncludeFilter overrides [equals](1) but not hashCode.
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest changes lgtm 👍

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 30, 2023

All tests passed except standard-its / (Compile=openjdk8, Run=openjdk8, Cluster Build On K8s) ITNestedQueryPushDownTest integration test, which has been weird in other PRs as well, and I don't believe would be related to the code in this patch. So, I'll merge it.

@gianm gianm merged commit 67fbd8e into apache:master Jun 30, 2023
@gianm gianm deleted the query-datasketches-hll-utf8 branch June 30, 2023 19:45
@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
* Add "stringEncoding" parameter to DataSketches HLL.

Builds on the concept from apache#11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

* Fix benchmark.

* Fix style issues, improve test coverage.

* Put round back, to make IT updates easier.

* Fix test.

* Fix issue with filtered aggregators and add test.

* Use DS native update(ByteBuffer) method. Improve test coverage.

* Add another suppression.

* Fix ITAutoCompactionTest.

* Update benchmarks.

* Updates.

* Fix conflict.

* Adjustments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants