Performance degradation for sketch queries with higher levels of broker merge parallelism

### Affected Version

0.17 and higher

### Description

For Druid clusters with parallel merge enabled on the brokers, query performance can degrade for sketch based queries. These queries can exhibit 30% - 100% degradation in performance, depending on the value set for `druid.processing.merge.pool.parallelism` and the number of sequences that needs to be merged and combined . With a smaller parallelism setting however, the performance is slightly better than running it single-threaded. 
Setting `druid.processing.merge.pool.parallelism` to 3 gave the best performance (~10% faster than the serial merge). However, increasing the parallelism beyond 3 is when we start to see degradation mentioned earlier.

I compared the flame graphs for a query on the broker toggling the merge parallelism and it looks like the parallel merge does a lot more combine operations on [Theta Union](https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/theta/Union.html) objects compared to its serial counterpart. 

<img width="1660" alt="Screen Shot 2021-04-19 at 3 21 59 PM" src="https://user-images.githubusercontent.com/4603202/115302880-8827b700-a128-11eb-80be-e11e8075710d.png">

Combining two union objects involves [compacting one of them into a sketch before updating it to the second union](https://github.com/apache/druid/blob/fdc3c2f36281059c020df1d4e74eb940d4c922e8/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchHolder.java#L149-L150). This process is relatively slower as Theta union is designed to optimize merge efficiency for large number of sketch merges. 

The first layer of the parallel merge process combines partitions of sequences and for sketch aggregators this generates Union objects  which are pushed into the intermediate blocking queues for the second layer.
The `MergeCombineAction` in the second layer runs on a single thread and this has to do a bunch of union merge operations based on the number of intermediate blocking queues it has to read from. Therefore, higher the parallelism, more the union merge operations that the `MergeCombineAction` has to perform and worse the performance. 

I don’t have a fix for this atm, but documenting this here for reference in case someone else runs into the same problem. If datasketches had a way to merge uncompacted unions, it could help alleviate this problem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation for sketch queries with higher levels of broker merge parallelism #11133

Affected Version

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance degradation for sketch queries with higher levels of broker merge parallelism #11133

Description

Affected Version

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions