Skip to content

Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.#11172

Merged
gianm merged 3 commits intoapache:masterfrom
gianm:query-dictionary-utf8
Apr 30, 2021
Merged

Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.#11172
gianm merged 3 commits intoapache:masterfrom
gianm:query-dictionary-utf8

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Apr 27, 2021

The idea is that certain operations (like count distinct on strings) will
be faster if they are able to run directly on UTF-8 bytes instead of on
Java Strings decoded by "lookupName".

I'm looking into modifying the cardinality and datasketch hll build
aggregators to use lookupNameUtf8 instead of lookupName in a
follow-on patch, and it is able to speed them up quite a bit.

…ector.

The idea is that certain operations (like count distinct on strings) will
be faster if they are able to run directly on UTF-8 bytes instead of on
Java Strings decoded by "lookupName".
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this seems very useful.

Off the top of my head, 50a6839#diff-10d947f7e5fa97a2ece2629950472d210574627f9f728f2e934b8a8b001fa4bbR46 is at least one other place where this would help a lot, I'm sure there are a handful of other sketches which are just going to be converting strings from selectors back into bytes too though.

Another thing to consider for future work, it seems like it also makes sense to allow for value matchers to match byte values directly (making a DruidStringPredicate class I guess instead of using Predicate<String> I suppose?), so that filter implementations that want bytes could also be potentially optimized.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Apr 30, 2021

I was also thinking of using this to enable moving the groupBy dictionaries off-heap (we could copy strings from segments to merge buffers as UTF-8 and avoid string transcoding).

@gianm gianm merged commit 046069f into apache:master Apr 30, 2021
@gianm gianm deleted the query-dictionary-utf8 branch April 30, 2021 17:56
gianm added a commit to gianm/druid that referenced this pull request May 5, 2021
Builds on the concept from apache#11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
gianm added a commit that referenced this pull request Jun 30, 2023
* Add "stringEncoding" parameter to DataSketches HLL.

Builds on the concept from #11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

* Fix benchmark.

* Fix style issues, improve test coverage.

* Put round back, to make IT updates easier.

* Fix test.

* Fix issue with filtered aggregators and add test.

* Use DS native update(ByteBuffer) method. Improve test coverage.

* Add another suppression.

* Fix ITAutoCompactionTest.

* Update benchmarks.

* Updates.

* Fix conflict.

* Adjustments.
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
* Add "stringEncoding" parameter to DataSketches HLL.

Builds on the concept from apache#11172 and adds a way to feed HLL sketches
with UTF-8 bytes.

This must be an option rather than always-on, because prior to this
patch, HLL sketches used UTF-16LE encoding when hashing strings. To
remain compatible with sketch images created prior to this patch -- which
matters during rolling updates and when reading sketches that have been
written to segments -- we must keep UTF-16LE as the default.

Not currently documented, because I'm not yet sure how best to expose
this functionality to users. I think the first place would be in the SQL
layer: we could have it automatically select UTF-8 or UTF-16LE when
building sketches at query time. We need to be careful about this, though,
because UTF-8 isn't always faster. Sometimes, like for the results of
expressions, UTF-16LE is faster. I expect we will sort this out in
future patches.

* Fix benchmark.

* Fix style issues, improve test coverage.

* Put round back, to make IT updates easier.

* Fix test.

* Fix issue with filtered aggregators and add test.

* Use DS native update(ByteBuffer) method. Improve test coverage.

* Add another suppression.

* Fix ITAutoCompactionTest.

* Update benchmarks.

* Updates.

* Fix conflict.

* Adjustments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants