Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector. by gianm · Pull Request #11172 · apache/druid

gianm · 2021-04-27T19:31:46Z

The idea is that certain operations (like count distinct on strings) will
be faster if they are able to run directly on UTF-8 bytes instead of on
Java Strings decoded by "lookupName".

I'm looking into modifying the cardinality and datasketch hll build
aggregators to use lookupNameUtf8 instead of lookupName in a
follow-on patch, and it is able to speed them up quite a bit.

…ector. The idea is that certain operations (like count distinct on strings) will be faster if they are able to run directly on UTF-8 bytes instead of on Java Strings decoded by "lookupName".

clintropolis

👍 this seems very useful.

Off the top of my head, 50a6839#diff-10d947f7e5fa97a2ece2629950472d210574627f9f728f2e934b8a8b001fa4bbR46 is at least one other place where this would help a lot, I'm sure there are a handful of other sketches which are just going to be converting strings from selectors back into bytes too though.

Another thing to consider for future work, it seems like it also makes sense to allow for value matchers to match byte values directly (making a DruidStringPredicate class I guess instead of using Predicate<String> I suppose?), so that filter implementations that want bytes could also be potentially optimized.

gianm · 2021-04-30T16:22:55Z

I was also thinking of using this to enable moving the groupBy dictionaries off-heap (we could copy strings from segments to merge buffers as UTF-8 and avoid string transcoding).

Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches.

* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.

* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.

Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySel…

aa51e90

…ector. The idea is that certain operations (like count distinct on strings) will be faster if they are able to run directly on UTF-8 bytes instead of on Java Strings decoded by "lookupName".

gianm added the Area - Querying label Apr 27, 2021

gianm added 2 commits April 28, 2021 16:55

Add license header.

b351738

Updates suggested by robots.

3554659

clintropolis approved these changes Apr 30, 2021

View reviewed changes

gianm merged commit 046069f into apache:master Apr 30, 2021

gianm deleted the query-dictionary-utf8 branch April 30, 2021 17:56

This was referenced May 5, 2021

Add "stringEncoding" parameter to DataSketches HLL. #11201

Merged

Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch. apache/datasketches-java#353

Merged

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.#11172

Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.#11172
gianm merged 3 commits intoapache:masterfrom
gianm:query-dictionary-utf8

gianm commented Apr 27, 2021

Uh oh!

clintropolis left a comment

Uh oh!

gianm commented Apr 30, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gianm commented Apr 27, 2021

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

gianm commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gianm commented Apr 30, 2021 •

edited

Loading