Replace ByteBuffer with int[] for group-by key#3111
Replace ByteBuffer with int[] for group-by key#3111navis wants to merge 1 commit intoapache:masterfrom
Conversation
| ByteBuffer newKey = key.duplicate(); | ||
| newKey.putInt(dimValue); | ||
| unaggregatedBuffers = updateValues(newKey, dims.subList(1, dims.size())); | ||
| int[] newKey = Arrays.copyOf(key, key.length); |
There was a problem hiding this comment.
This changes behavior. In the /master impl only one underlying copy of the data exists. here, you are expanding the quantity of heap objects.
We are already having heap problems on our highly utilized nodes. If it is possible to NOT create more heap objects during this workflow that would be nice.
There was a problem hiding this comment.
I think in the most of cases, key.duplicate() (1 ref, 5 int, 1 long, 1 boolean) uses more memory than simple int[].
There was a problem hiding this comment.
You'd have to have 7 (I think) dimensions before heap use of int[] exceeds ByteBuffer. Fewer than that and the int[] would actually be cheaper. Hard to say which is better without data, but my guess most real world groupBy queries are going to be roughly at or below 7 dimensions.
There was a problem hiding this comment.
btw, in v2, groupBy is reworked to avoid this recursive structure, which reduces allocations and stack depth.
|
I've did some similar test with HyperLogLogCollector, which uses ByteBuffer. By replacing ByteBuffer to five columns + byte[], Can see some changes in number, especially for IncrementalIndex. For the reference, this is after removing HLL aggregator, |
|
@navis The HLL effect looks pretty big, that's cool! What change did you make exactly? |
|
This change looks good to me assuming we can get consensus on the comment chain in #3111 (comment). |
|
@gianm attached the patch I've used. it's just done for fun not for PR. |
|
I think ByteBuffer is preferred way of using memory in druid. If it could be used with off-heap memory by some configuration, it's reasonable to keep that way. but I saw some cases it's never be used with off-heap and got curious if it affects performance replacing that with simpler format (int[] of byte[]). I don't have hard intention this to be merged into master but wanted to show that using int[] or byte[] is reasonable in some cases. |
|
I think this approach would complicate numeric dimensions, since the key would no longer only contain ints |
|
@jon-wei fair enough. I'm closing. |
RowUpdater in GroupByQueryEngine uses ByteBuffer for group by key. replacing it with int[] seemed faster and uses smaller footprint.
ByteBuffer
int array