TDigest backed sketch aggregators#7331
Conversation
|
@jihoonson, @gianm - would one of you have some bandwidth to review this PR? |
|
Hi @samarthjain, I'm currently trying to finalize 0.14.0 release. I will take a look after the release. |
|
@jihoonson - would you have bandwidth now to review this PR? |
|
@samarthjain ah sorry. I forgot about this PR. Will take a look soon. |
82be410 to
2050025
Compare
There was a problem hiding this comment.
This needs to be org.apache.druid.extensions.contrib now
There was a problem hiding this comment.
Ah! I initially built this against 0.12.2 since that is what we use at my day job. Fixed.
2050025 to
77f4deb
Compare
|
Hi @samarthjain, now I'm reviewing your PR. Will leave some comments soon. |
jihoonson
left a comment
There was a problem hiding this comment.
@samarthjain sorry for the delayed review! Just left some comments. Also would you please resolve conflicts?
There was a problem hiding this comment.
Maybe it should say "Casting to float type is not supported". Similar for getLong().
There was a problem hiding this comment.
Please add @Nullable for compression.
There was a problem hiding this comment.
COMRESSION -> COMPRESSION
There was a problem hiding this comment.
How is it different from TDigestBuildSketchAggregator.DEFAULT_COMPRESSION?
There was a problem hiding this comment.
Would you please raise an issue for this?
There was a problem hiding this comment.
Have you had a chance to file a Github issue for this? We usually do to track each issue correctly. I don't see any issues about this: https://github.com/apache/incubator-druid/issues/created_by/samarthjain.
There was a problem hiding this comment.
BufferAggregator doesn't have to be synchronized because it's not used in incremental index.
There was a problem hiding this comment.
@jihoonson - unfortunately the documentation on the base classes/interfaces doesn't clearly mention which methods could be called in a multi-threaded fashion. So I ended up following what the DataSketches implementation does.
For ex -
https://github.com/apache/incubator-druid/blob/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketchBuildBufferAggregator.java#L54
There was a problem hiding this comment.
Yeah, it's lame that the doc is missing about what should be synchronized. I think DataSketches implementations are wrong. It doesn't have to be synchronized because concurrent reads and writes can happen only in incremental index. You would see other BufferAggregator implementations of druid-core or druid-extensions-core don't do it.
There was a problem hiding this comment.
For clarity, when building an incremental index, are aggregators invoked? And is that BufferedAggregator or Aggregator. From your comments it sounds like we needn't worry about thread safety for BufferedAggregators but what about Aggregators? Looking at HistogramAggregator or HistogramBufferAggregator, I don't see any kind of synchronization.
There was a problem hiding this comment.
If a query is issued while a stream ingestion task is running, then the query would be routed to that task. This is when concurrent reads and writes can happen. Since only OnHeapIncrementalIndex is used at ingestion time which uses Aggregator, we need to consider if there's any concurrency issue between get() and aggregate(). Check out these comments: #5002 (comment), #5148 (comment)
I'm not sure why HistogramAggregator is not synchronized even though it looks to have to.
There was a problem hiding this comment.
Thanks. I have made the change to synchronize access for get() and aggregate()
There was a problem hiding this comment.
Same here. You don't have to synchronize these methods.
|
I missed one more comment. Would you please add a document for this extension? |
|
Hi @samarthjain thank you for updating the PR. It looks that the PR is somehow messed up. Would you check it again please? |
|
@jihoonson - sorry about that. Looks like I pushed the garbage that Intellij generated. Fixed the commit by removing all the intellij related changes. I still haven't written the docs yet for tdigest. Working on them now. |
|
@samarthjain thanks for fixing it quickly! I'll take another look. |
|
|
||
| |property|description|required?| | ||
| |--------|-----------|---------| | ||
| |type|This String should always be "buildTDigestSketch"|yes| |
There was a problem hiding this comment.
Should be quantilesFromTDigestSketch
| @Override | ||
| public synchronized Object get(final ByteBuffer buffer, final int position) | ||
| { | ||
| return sketches.get(buffer).get(position); |
There was a problem hiding this comment.
The get() on the buffer aggregator needs to return a snapshot copy of the sketch to avoid use-after-free issues (see recently updated javadocs on get() and #7464)
There was a problem hiding this comment.
Thank you for calling out this!
There was a problem hiding this comment.
In this case we don't need to create a copy since sketches is an IdentityHashMap where buffer's reference only is used for equality purposes (i.e. it isn't effected by changes in the actual object). I have added a comment.
There was a problem hiding this comment.
Hmm yeah, it makes sense since MergingDigest is stored on heap.
There was a problem hiding this comment.
Maybe we need to use an off-heap implementation in the future. Seems like there's no off-heap implementation currently.
jihoonson
left a comment
There was a problem hiding this comment.
The latest change looks good to me. +1 after CI.
Thanks @samarthjain!
|
Thanks a lot for your reviews, @jihoonson . Would it be possible to merge this feature to 0.15-incubating branch as well? It is a new feature and since it is already reviewed and merged to master, I think it could make sense to include it in the next-to-be-out release as well. Further, this is a standalone module to so it won't be effecting the stability of the overall 0.15 release, either. |
|
Hi @samarthjain, our current release policy is time based release and we are branching out every 3 months. Once the branch is created, we are backporting only release blockers such as regression bug fixes, security bug fixes, or wrong license. Since the 0.15.0 branch was created on last Tuesday, this PR would be in 0.16.0 instead. But I understand your concern. People usually expect their works to be included in the next release, but currently they are not in many cases. I think slow review is one of the reasons causing this issue. We may need to try to make the whole review process faster. |
|
@jihoonson we should be tagging the release milestone to every merged PR |
|
@fjy just tagged. Thanks. |
* First set of changes for tDigest histogram * Add license * Address code review comments * Add a doc page for new T-Digest sketch aggregators. Minor code cleanup and comments. * Remove synchronization from BufferAggregators. Address code review comments * Fix typo
No description provided.