Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.#11172
Add a way to retrieve UTF-8 bytes directly via DimensionDictionarySelector.#11172gianm merged 3 commits intoapache:masterfrom
Conversation
…ector. The idea is that certain operations (like count distinct on strings) will be faster if they are able to run directly on UTF-8 bytes instead of on Java Strings decoded by "lookupName".
clintropolis
left a comment
There was a problem hiding this comment.
👍 this seems very useful.
Off the top of my head, 50a6839#diff-10d947f7e5fa97a2ece2629950472d210574627f9f728f2e934b8a8b001fa4bbR46 is at least one other place where this would help a lot, I'm sure there are a handful of other sketches which are just going to be converting strings from selectors back into bytes too though.
Another thing to consider for future work, it seems like it also makes sense to allow for value matchers to match byte values directly (making a DruidStringPredicate class I guess instead of using Predicate<String> I suppose?), so that filter implementations that want bytes could also be potentially optimized.
|
I was also thinking of using this to enable moving the groupBy dictionaries off-heap (we could copy strings from segments to merge buffers as UTF-8 and avoid string transcoding). |
Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches.
* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from #11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.
* Add "stringEncoding" parameter to DataSketches HLL. Builds on the concept from apache#11172 and adds a way to feed HLL sketches with UTF-8 bytes. This must be an option rather than always-on, because prior to this patch, HLL sketches used UTF-16LE encoding when hashing strings. To remain compatible with sketch images created prior to this patch -- which matters during rolling updates and when reading sketches that have been written to segments -- we must keep UTF-16LE as the default. Not currently documented, because I'm not yet sure how best to expose this functionality to users. I think the first place would be in the SQL layer: we could have it automatically select UTF-8 or UTF-16LE when building sketches at query time. We need to be careful about this, though, because UTF-8 isn't always faster. Sometimes, like for the results of expressions, UTF-16LE is faster. I expect we will sort this out in future patches. * Fix benchmark. * Fix style issues, improve test coverage. * Put round back, to make IT updates easier. * Fix test. * Fix issue with filtered aggregators and add test. * Use DS native update(ByteBuffer) method. Improve test coverage. * Add another suppression. * Fix ITAutoCompactionTest. * Update benchmarks. * Updates. * Fix conflict. * Adjustments.
The idea is that certain operations (like count distinct on strings) will
be faster if they are able to run directly on UTF-8 bytes instead of on
Java Strings decoded by "lookupName".
I'm looking into modifying the cardinality and datasketch hll build
aggregators to use lookupNameUtf8 instead of lookupName in a
follow-on patch, and it is able to speed them up quite a bit.