Kafka Index Task that supports Incremental handoffs#4178
Kafka Index Task that supports Incremental handoffs#4178pjain1 wants to merge 6 commits intoapache:masterfrom pjain1:kafka_indexing
Conversation
|
Unrelated failure |
|
@pjain1 I plan to read through this soon. Just from a really brief look though, KafkaIndexTask is 2200 lines now -- wow! It's probably worth looking at breaking this up into smaller files, maybe we could use a layer that sits between KafkaIndexTask and FiniteAppenderatorDriver. Or maybe incorporating the functionality into FiniteAppenderatorDriver, so it's no longer finite but can support handing off multiple times. What do you think? I might have more specific ideas after reading through… |
|
Actually |
|
That'd probably help for understanding the code. I'll have a more specific opinion after reading, but it does sound like a good idea to split that out. |
|
@gianm Created a separate file for |
- Incrementally handoff segments when they hit maxRowsPerSegment limit - Decouple segment partitioning from Kafka partitioning, all records from consumed partitions go to a single druid segment - Decouple publishing of segments from waiting for handoff - Support for restoring task on middle manager restarts by check pointing end offsets for segments
There was a problem hiding this comment.
Better to use SegmentIdentifier instead of String
There was a problem hiding this comment.
Also, I think it would be better to move this to DriverHolder because it is the only class which is using this information.
There was a problem hiding this comment.
driver.getMaxRowPerSegment() looks that each driver has different maxRowsPerSegment which is actually not. It would be better to use KafkaTuningConfig.getMaxRowsPerSegment() instead and remove driver.getMaxRowPerSegment().
|
@pjain1 @jihoonson when reading #4238 I see a lot of similarities in goals between the two patches. I'm wondering if some changes to FiniteAppenderatorDriver would make them both doable a little more simply; see this thread: #4238 (comment) Could that reduce some of the complexity from this patch, maybe making DriverHolders and the extra tracking in KafkaIndexTask unnecessary, since a single AppenderatorDriver could handle multiple handoffs? |
|
@jihoonson didn't got chance to look at your comments, will get back to it after taking a look at #4238 |
|
@gianm I read your comment about design change. In the new design I am not sure how a single AppenderatorDriver can support multiple active segments for same interval. For example, when maxRowsInSegmentLimit is reached a checkpoint (set of partitionOffset) is decided on by replica tasks. Now when each replica fetches records from Kafka then depending on offset they decide which driver should the record go to, the driver before the checkpoint or the one after that. I am not sure if my explanation is clear enough if not I will try to explain it more. |
|
Hmm. I was thinking that the replicas would sort of chill out for a bit when it's checkpoint time, and nobody would read past the checkpoint until they agree on what it is. Or, instead of that, would it work to do something like:
In this world each checkpoint block would have its own sequenceName but they would all be managed by the same AppenderatorDriver. And AppenderatorDriver, I suppose, would want to track metadata separately for each pending publish. Would something like that allow us to avoid the need for a checkpoint table and DriverHolders? |
|
Sure replicas pause when its the checkpoint time however both replicas may have read till different offsets in different partition. So after checkpoint is decided when they resume reading then some records for some partition may fall into next set of segments (after checkpoint) and some for the current set of segments (before checkpoint). I thought of the approach that you mentioned of having different sequenceName for set of segments corresponding to a checkpoint, so there will be a map of sequenceName -> activeSegments. In this case as you said, AppenderatorDriver will have to maintain state of map of sequenceName -> activeSegments, state of pending publishes which is essentially what DriverHolder is doing now. However, a kind of checkpoint table will still be needed because when a replica dies and a new task BTW task checkpoint can be cleaned up automatically by making a simple change to the Segment published where the checkpoint information is also deleted as soon as corresponding driver has finished publishing its segments. Having said all this I am totally OK with having a single AppenderatoDriver and getting rid of DriverHolder and task checkPoint table if we make it to work. |
Ah, that makes sense. It seems possible to me that making AppenderatorDriver the thing that does that would be simpler overall, so only one layer (namely: AppenderatorDriver) needs to worry about how active segments are tracked. Do you think it'd be better or worse to teach AppenderatorDriver how to do that sort of stuff?
Could the supervisor start off the replacement task from the most recent metadata in the druid_datasources table we already have? It's acting like a checkpoint already, just there's only one of them. I think that should work if we assume that there will be at most one publish happening at a time. This is already assumed by the current system (the supervisor does start new tasks before old ones exit, but only one can be publishing or else there will be txn failures). For the sake of keeping things simple I think it's OK to keep assuming that.
I hope we can do something like that in order to keep things simple, but if we end up not being able to, that's life. I'd at least try to work it all out though. Thanks for bearing with me. |
There is a corner case here - Suppose two replicas decided on a checkpoint and then one of the replica dies. Another replica has not consumed till the checkpoint and have not published segments corresponding to the checkpoint, thus dataSourceMetadata does not yet reflect the checkpoint. Supervisor start a replacement task which will miss this checkpoint and at some point when it reaches its maxRowsInSegmentLimit will send a CheckPoint request and wrong things will happen.
No worries, I am fully onboard with making things simple. |
What sort of wrong things would happen? I think in the case you're talking about, the supervisor would get a checkpoint request that is "behind" the currently published datasource metadata. Would it help if the supervisor then told the task something like, "hey, you're really behind, you should toss out what you have and start over from here: <insert current metadata>" |
|
@jihoonson For Kafka Indexing task publish call should never block and from the code it seems like it can block. I would suggest to have an unbounded queue to which publish tasks can be submitted and |
|
@jihoonson Or IMO better option would be to run the @gianm This and previous comment are the reasons why we thought it might be better for tasks to decide how to call and handle publish and not to use executor in the AppenderatorDriver. What do you think ? |
@pjain1 ah right. I missed changing it to be unbounded. I'm still difficult to understand what the benefits are when |
|
@jihoonson |
|
@pjain1 ah right. It should be some reasonable large value k >> 0 which doesn't exceed VM limit. I think it should be configurable. Does it make sense? |
|
By k you mean the queue capacity ? If k is set to a large number then why not just use I guess I am OK with both approach, having it configurable or using |
|
Ah right. I didn't notice the |
|
I think it is ok for it to return ListenableFuture. Thanks |
|
Thanks. I think it will take about 2 days. I'll ping you when I raise the PR. |
|
Hi @pjain1, I was wondering, are you still working on this? It would still be useful! |
|
And do you think we could/should mark this for 0.11? |
|
@gianm Yes I am still working on it. I am done with the code, working on testing it. I think it can go in 0.11 |
|
Thank you @pjain1. I marked the milestone as 0.11.0. |
|
resumed in #4815 |
|
Closing in favor of #4815 |
Fixes #4016 and #4177
Design -
DriverHolderthat has start and end offsets and thus represents the Kafka records that theFiniteAppenderatorDriverit wraps should consume. EachDriverHolderhas a unique sequence name which is concatenation of the baseSequenceName of the task and a suffix of linearly increasingDriverHolderindex. Similarly, the persist directory for each holder is concatenation of basePersistDir of the task and the holder index.DriverHolderis created with start and end offsets of the task.DriverHolderthat can handle it.maxRowsInSegmentlimit the task pauses and Supervisor is sentCheckPointDataSourceMetadataActionwhich pauses all replica tasks, gets the highest offset for all partitions, store these partition offsets as a check point in the metadata store for the baseSequenceName of the task. Then sends the check point to all replica tasks usingsetEndOffsetcall withfinishquery param asfalseindicating that these are not the final end offsets for the task. The latestDriverHolderin the task updates its local end offsets to the check point and a newDriverHolderis created to handle records beyond this check point.DriverHolderhas consumed till its end offsets, it is submitted to apersistExecServicewhich persists the driver and then adds the driver to apublishQueue. Another, executor service namedpublishExecServicetakes the driver from thepublishQueueand tries to push and publish the segments to deep storage and metadata store. After a successful publish the driver is added tohandOffQueueandhandOffExecServicekeeps on waiting for completion of hand off of the driver.SentinelDriverHolderis submitted to be persisted and published. When thisSentinelDriverHolderis done, the task ends.DriverHolderlist is persisted every time a critical action happens like - a new holder is created, end offsets for a holder is set and a holder is removed from a list.DriverHolders if necessary.This code is currently running on our metrics cluster for few days without any issues.
Note -
We do not run replicas on our metrics cluster.Now running with 2 replicas since last few days.TODO
- Possibly test with replicas on real cluster