KAFKA-12226: Prevent source task offset failure when producer is overwhelmed#10112
KAFKA-12226: Prevent source task offset failure when producer is overwhelmed#10112C0urante wants to merge 4 commits intoapache:trunkfrom
Conversation
|
@gharris1727 @tombentley @chia7712 anyone got a moment? 😃 |
gharris1727
left a comment
There was a problem hiding this comment.
I think recordFlushPending is a much better name than flushing, these seem like good changes to me.
It's a bit tough to parse the tests, but that seems par for the course in that file. The two test you added seem to be testing very similar variants of the same scenario though, and it's very hard to pick out the difference between them. Is there a way to make the functional differences between the variants more clear to the next person to read these tests?
|
Yeah, that's fair. I found which seems applicable here. I'll refactor the tests to use a similar approach. Thanks Greg! |
gharris1727
left a comment
There was a problem hiding this comment.
Thanks @C0urante, LGTM!
|
The only consistent test failure for the last run appears unrelated to the changes here. See https://github.com/apache/kafka/pull/10140/checks?check_run_id=1917700907 and https://github.com/apache/kafka/pull/10077/checks?check_run_id=1877979511 for other instances of the same failure. org.apache.kafka.connect.integration.InternalTopicsIntegrationTest.testCreateInternalTopicsWithDefaultSettings |
mimaison
left a comment
There was a problem hiding this comment.
Thanks for the PR.
I wonder if the test could be simplified a bit as it currently looks pretty scary! This is in part caused by EasyMock.
Restarting the task is nice to ensure we recover correctly but could we try that with send() blocking, we're still able to progress with the offsets?
|
Yeah, the unit tests for the worker classes in general can be a little gnarly. The boilerplate segments (for things like setting up the converter, transformation chain, task context, headers, topic tracking, offset buffering, status tracking, and performing metrics assertions) are in line with the other tests in this class, though. The only difference with the new unit test here is that we want to test task behavior under some pretty fine-grained circumstances, which is accomplished right now by setting up latches and awaiting them to ensure that the task (which is running on a separate thread) has reached certain points in its lifecycle and not gone any further. If you have suggestions for how to improve that I'm all ears! I'm not sure what you're referring to with task restart--that's not tested for at the moment, and any tests for that would likely be dependent on the success (or lack thereof) of any prior offset commit attempts, which are already tested for. Can you clarify or provide a brief example? |
|
I did not look at the tests closely yet but I thought I hope to take another look some time next week. |
|
Ah, gotcha! Yeah, This is because right now, when an offset commit attempt fails because the current batch of source records wasn't flushed to Kafka in time, all backlogged records that were read from For example:
At this point, the current behavior is:
If the task is generating a steady throughput of 10000 records per offset commit attempt, and the worker's producer is only able to write 5000 of those before the offset commit attempt times out, the worker will never be able to successfully commit offsets for the task, even though there are plenty of records that have been sent to and ack'd by the broker. The proposed behavior in the PR is:
|
| private boolean flushing; | ||
| private boolean recordFlushPending; | ||
| private boolean offsetFlushPending; | ||
| private CountDownLatch stopRequestedLatch; |
There was a problem hiding this comment.
nit: while we're at it, this could be final
| boolean flushStarted = offsetWriter.beginFlush(); | ||
| // No need to begin a new offset flush if we timed out waiting for records to be flushed to | ||
| // Kafka in a prior attempt. | ||
| if (!recordFlushPending) { |
There was a problem hiding this comment.
If I understand it correctly, the main difference in this patch is that we no longer fail the flush if the messages cannot be drained quickly enough from outstandingMessages. A few questions come to mind:
- Is the flush timeout still a useful configuration? Was it ever? Even if we timeout, we still have to wait for the records that were sent to the producer.
- While we are waiting for
outstandingMessagesto be drained, we are still accumulating messages inoutstandingMessagesBacklog. I imagine we can get into a pattern here once we fill up the accumulator. While we're waiting foroutstandingMessagesto complete, we filloutstandingMessagesBacklog. Once the flush completes,outstandingMessagesBacklogbecomesoutstandingMessagesand we are stuck waiting again. Could this prevent us from satisfying the commit interval?
Overall, I can't shake the feeling that this logic is more complicated than necessary. Why do we need the concept of flushing at all? It would be more intuitive to just commit whatever the latest offsets are. Note that we do not use outstandingMessages for the purpose of retries. Once a request has been handed off to the producer successfully, we rely on the producer to handle retries. Any delivery failure after that is treated as fatal. So then does oustandingMessages serve any other purpose other than tracking flushing? I am probably missing something here. It has been a long time since I reviewed this logic.
There was a problem hiding this comment.
- I think it's a necessary evil, since source task offset commits are conducted on a single thread. Without a timeout for offset commits, a single task could block indefinitely and disable offset commits for all other tasks on the cluster.
- This is definitely possible; I think the only saving grace here is that the combined sizes of the
outstandingMessagesandoutstandingMessagesBacklogfields is going to be naturally throttled by the producer's buffer. If too many records are accumulated, the call toProducer::sendwill block synchronously until space is freed up, at which point, the worker can continue polling the task for new records. This isn't ideal as it will essentially cause the producer's entire buffer to be occupied until the throughput of record production from the task decreases and/or the write throughput of the producer rises to meet it, but it at least establishes an upper bound for how large a single batch of records in theoustandingMessagesfield ever gets. It may take several offset commit attempts for all of the records in that batch to be ack'd, with all but the last (successful) attempt timing out and failing, but forward progress with offset commits should still be possible.
I share your feelings about the complexity here. I think ultimately it arises from two constraints:
- A worker-global producer is used to write source offsets to the internal offsets topic right now. Although this doesn't necessarily require the single-threaded logic for offset commits mentioned above, things become simpler with it.
- (Please correct me if I'm wrong on this point; my core knowledge is a little fuzzy and maybe there are stronger guarantees than I'm aware of) Out-of-order acknowledgment of records makes tracking the latest offset for a given source partition a little less trivial than it seems initially. For example, if a task produces two records with the same source partition that end up being delivered to different topic-partitions, the second record may be ack'd before the first, and when it comes time for offset commit, the framework would have to refrain from committing offsets for that second record until the first is also ack'd.
I don't think either of these points make it impossible to add even more-fine-grained offset commit behavior and/or remove offset commit timeouts, but the work involved would be a fair amount heavier than this relatively-minor patch. If you'd prefer to see something along those lines, could we consider merging this patch for the moment and perform a more serious overhaul of the source task offset commit logic as a follow-up, possibly with a small design discussion on a Jira ticket to make sure there's alignment on the new behavior?
There was a problem hiding this comment.
(Please correct me if I'm wrong on this point; my core knowledge is a little fuzzy and maybe there are stronger guarantees than I'm aware of) Out-of-order acknowledgment of records makes tracking the latest offset for a given source partition a little less trivial than it seems initially. For example, if a task produces two records with the same source partition that end up being delivered to different topic-partitions, the second record may be ack'd before the first, and when it comes time for offset commit, the framework would have to refrain from committing offsets for that second record until the first is also ack'd.
Ok, that rings a bell. I think I see how the logic works now and I don't see an obvious way to make it simpler. Doing something finer-grained as you said might be the way to go. Anyway, I agree this is something to save for a follow-up improvement.
I think it's a necessary evil, since source task offset commits are conducted on a single thread. Without a timeout for offset commits, a single task could block indefinitely and disable offset commits for all other tasks on the cluster.
Hmm.. This is suspicious. Why do we need to block the executor while we wait for the flush? Would it be simpler to let the worker source task finish the flush and the offset commit in its own event thread? We end up blocking the event thread anyway because of the need to do it under the lock.
There was a problem hiding this comment.
We end up blocking the event thread anyway because of the need to do it under the lock.
I think we actually keep polling the task for records during the offset commit, which is the entire reason we have the outstandingMessagesBacklog field. Without it, we'd just add everything to outstandingMessages knowing that, if we've made it to the point of adding a record to that collection, we're not in the process of committing offset, right?
Concretely, we can see that the offset thread relinquishes the lock on the WorkerSourceTask instance while waiting for outstanding messages to be ack'd.
I'm not sure we need to perform offset commits on a separate thread, but it is in line with what we do for sink tasks, where we leverage the Consumer::commitAsync method.
If we want to consider making offset commit synchronous (which is likely going to happen anyways when transactional writes for exactly-once source are introduced), that also might be worth a follow-up. The biggest problem I can think of with that approach would be that a single offline topic-partition would block up the entire task thread when it comes time for offset commit. If we keep the timeout for offset commit, then that'd limit the fallout and allow us to resume polling new records from the task and dispatching them to the producer after the commit attempt timed out. However, there'd still be a non-negligible throughput hit (especially for workers configured with higher offset timeouts).
There was a problem hiding this comment.
It's mostly the flushing that concerns me, not really the offset commit. I don't think we need to make it synchronous, just that it seems silly to block that shared scheduler to complete it. My thought instead was to let the scheduler trigger the flush, but then let the task be responsible for waiting for its completion. While waiting, of course, it can continue writing to outstandingMessagesBacklog. So I don't think there should be any issue from a throughput perspective.
There was a problem hiding this comment.
I've been ruminating over this for a few days and I think it should be possible to make task offset commits independent of each other by changing the source task offset commit scheduler to use a multi-threaded executor instead of a global single-threaded executor for all tasks. This isn't quite the same thing as what you're proposing since tasks would still not be responsible for waiting for flush completion (the offset scheduler's threads would be), but it's a smaller change and as far as I can tell, the potential downsides only really amount to a few extra threads being created.
The usage of scheduleWithFixedDelay already ensures that two offset commits for the same task won't be active at the same time, as it "Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given delay between the termination of one execution and the commencement of the next."
Beyond that, the only concern that comes to mind is potential races caused by concurrent access of the offset backing store and its underlying resources.
In distributed mode, the KafkaOffsetBackingStore and its usage of the underlying KafkaBasedLog appear to be thread-safe as everything basically boils down to calls to Producer::send, which should be fine.
In standalone mode, the MemoryOffsetBackingStore handles all writes/reads of the local offsets file via a single-threaded executor, so concurrent calls to MemoryOffsetBackingStore::set should also be fine.
Granted, none of this addresses your original concern, which is whether an offset commit timeout is necessary at all. In response to that, I think we may also want to revisit the offset commit logic and possibly do away with a timeout altogether. In sink tasks, for example, offset commit timeouts are almost a cosmetic feature at this point and are really only useful for metrics tracking. However, at the moment it's actually been pretty useful to us to monitor source task offset commit success/failure JMX metrics as a means of tracking overall task health. We might be able to make up the difference by relying on metrics for the number of active records, but it's probably not safe to make that assumption for all users, especially for what is intended to be a bug fix. So, if possible, I'd like to leave a lot of the offset commit logic intact as it is for the moment and try to keep the changes here minimal.
To summarize: I'd like to proceed by keeping the currently-proposed changes, and changing the source task offset committer to use a multi-threaded executor instead of a single-threaded executor. I can file a follow-up ticket to track improvements in offset commit logic (definitely for source tasks, and possibly for sinks) and we can look into that if it becomes a problem in the future. What do you think?
There was a problem hiding this comment.
I had another look at this PR. Here is my understanding:
In the old logic, if we could not flush pending records before the expiration of offset.flush.timeout.ms, then we would give up and try again later. The problem with this is that we may reach a point where the producer has built a big enough backlog of outstanding messages that they cannot be flushed before the expiration of offset.flush.timeout.ms. Basically we are taking records from the connector faster than they can be flushed. And if we ever reach this state, the connector is dead in the water because we are not able to commit offsets any more. So its progress appears to stall even though the data is still being copied.
The patch gets around the problem by relaxing offset.flush.timeout.ms a little bit. Rather than treating expiration of records as a fatal error, we continue to allow more time for outstandingMessages to be drained. This ensures that we do not have to wait for the messages from outstandingMessagesBacklog which are added while the flush is in progress.
Assuming my understanding is right, the only concern I have with this patch is the following. Ultimately this issue comes down to a slow producer which is not keeping up with the connector. When we begin a flush, we have to drain all of the outstanding data before we can commit offsets. For a slow producer, this could take a very long time (even weeks given delivery.timeout.ms of Int.MaxValue). We are eventually able to make progress, but users may still see progress indefinitely stalled. A good fix here I think would either prevent the backlog from reaching this point in the first place, or it would make offset commits more of an asynchronous process which does not depend on flushing all pending data. Intuitively, you expect the worker to commit whatever its progress is regularly without respect to the speed of the producer. A slow worker still goes slow, but at least users can track its progress. My understanding of the semantics here is a bit limited, so I do not know if this is possible.
The issues we have seen related to this issue came about from one slow broker or partition. This is a bad scenario for the producer because the pending data for the slow broker can exhaust the whole buffer. This effectively slows down every other partition since we constantly have to wait for room to free up in the buffer. I think it would be interesting to consider improvements to the partitioning logic in the producer to take into account the size of the pending data. The producer could then compensate for a slow broker by writing less data to it. On the other hand, this would cause data imbalances which might have downstream effects, so it might not be a clear-cut win.
Anyway, I am not very familiar with this logic, so I am hoping for additional reviews from @mimaison , @kkonstantine , and @rhauch to push this patch through. If you folks are happy with it, please do not wait for me.
|
First of all, thanks for trying to fix this issue, @C0urante. And thanks for your insight, @hachikuji. I agree that it seems like we should not have to block the offset commits until the full batch of records has been written to Kafka. I suspect the current logic was written this way because it's the simplest thing to do, given that the source partition map and offset map in the source records are opaque, meaning we can't sort them and have to instead rely upon the order of the source records returned by the connector. And because the producer can make progress writing to some topic partitions while not making progress on others, it's possible that some records in a batch are written before earlier records in the same batch. The bottom line is that we have to track offsets that can be committed using only the order of the records that were generated by the source task. The current logic simply blocks committing offsets until each "batch" of records is completely flushed. That way we can commit all of the offsets in the batch together, and let the offset writer rely upon ordering to use only the latest offset map for each partition map when we tell it to flush. But, flushing offsets requires synchronization, and the current logic switches between the @hachikuji wrote:
That's my understanding, too. And maybe I don't grasp the subtleties of the fix, but it seems like the fix won't necessarily help when a producer is consistently slow. In such cases, the Fortunately, we do have back pressure to not let this get this too out of control: when the producer's buffer fills up, the worker source task's thread will block (up to But I think we can change how offsets are flushed such that we don't have to wait for the producer, and instead we can simply flush the latest offsets for records that have been successfully written at that point. We just need a different mechanism (other than the two "outstanding" lists and the flush-related flags) to track the offsets for the most recently written records. One way to do that is to use a single concurrent queue that bookkeeps records in the same order as generated by the source task, but in a way that allows us to track which records have been acked and tolerates those records being acked in any order. For example, we could replace the An element is appended to this queue just before the record is sent to the producer, and the and where and then have the producer callback call the This effectively replaces the Then here's the big change: in I've shown the snippet above using a non-blocking queue of unlimited size. I think we could do this because the existing WorkerSourceTask logic already handles the possibility that the Alternatively, we could use a blocking queue, but this would require an additional worker configuration, which is not ideal and can't be backported. @C0urante, WDYT? |
|
My previous suggestion simply dequeues all completed records until an unacked record is found. This is really straightforward, but we could try to do better. We could dequeue all records except those that have a source partition that has not been acked. For example, let's say we have enqueue 4 records when This might happen if records 4, 5, and 7-10 were written to different topic partitions than records 1, 2, 3, and 6, and the producer is stuck on the latter partitions. With the simplistic logic, we'd only dequeue record 1 and 2, we'd add the offset for these two records to the offset writer, and we'd flush offset There are quite a few records with source partition However, if we dequeue all acked records with a source partition map that does not match a previously un-acked record, then we'd be able to dequeue more records and also flush offsets This minor change will dramatically improve the ability to commit offsets closer to what has actually be acked. |
|
Also, with my proposed approach, the |
|
@rhauch Overall that looks good to me. It's an elegant solution to the tricky problem you noted about the opacity of task-provided source offsets w/r/t ordering. I'm a little worried about offset commits taking longer and longer with the more sophisticated approach you proposed (where we would unconditionally iterate over every record in the batch, instead of only until the first unacknowledged record). It's true that there would be natural back pressure from the producer as its If this is a valid concern and we'd like to take it into account for now, I can think of a couple ways to handle it off the top of my head:
I think option 3 may be warranted, although it's still possible that offset commits take a long time if 32MB worth of records end up getting queued. Option 2 may be worth implementing or at least considering as a follow-up item to handle this case. Thoughts? |
|
@C0urante, thanks for the feedback on my suggestion. I like your option 3, because it does allow the iteration to stop on each source partition as soon as it encounters the first unacknowledged record in each queue. I also think that the behavior with the suggested approach and your option 3 is still a lot better than the current situation. One question, though: you mention that it might be a problem if iterating over the submitted records takes longer than Of course, another option might be to incur the iteration on the worker source task thread. That would essentially move the use of the queue(s) to the worker source task thread, tho we still need to get the offsets to the offset commit thread and so would likely have to keep the synchronization blocks around the offset writer snapshot. On one hand, that's putting more work onto the worker source task thread and making the offset thread super straightforward (snapshot and write); on the other it's putting the onus on the worker source task thread. Thoughts? |
Agreed 👍
That's mostly correct--we wouldn't be waiting on a blocking operation while iterating through the dequeue(s), although we might still choose to block on the actual write to the offset topic in the same way that we currently do just for the sake of metrics and allowing users to monitor the health of the connection between the Connect worker and the offsets topic. Not a huge deal though, and the point that we wouldn't be blocking on the task's producer is still valid. I think the issue is less that we'd end up timing out and more that we'd end up violating the guarantee that's provided right now by the framework that each task gets to take up only
I think this'd be great, especially with the snapshotting logic you mention, which should basically eliminate any blocking between the two threads except to prevent race conditions while simple operations like clearing a hash map or assigning a new value to an instance variable take place. One thing that gave me pause initially was the realization that we'd be double-iterating over every source record at this point: once to transform, convert, and dispatch the record to the producer, and then once to verify that it had been acknowledged while iterating over the dequeue it's in. But I can't imagine it'd make a serious difference with CPU utilization given that transformation, conversion, and dispatching to a producer are likely to be at least an order of magnitude more expensive than just checking a boolean flag and possibly inserting the record's offset into a hash map. And memory utilization should be very close to the existing approach, which already tracks every single unacknowledged record in the I think this buys us enough that my earlier-mentioned option 2 (multiple threads for offset commits) isn't called for, since the only blocking operation that would be performed during offset commit at this point is a write to the offsets topic. If the offsets topic is unavailable, it's likely that the impact would be the same across all tasks (unless the task is using a separate offsets topic, which will become possible once the changes for KIP-618 are merged), and even if not, things wouldn't be made any worse than they already are: the offset flush timeout would expire, and the next task in line would get its chance to commit offsets. @rhauch If this is all agreeable I think we're ready to start implementing. Since you've provided a lot of the code yourself I'm happy to let you take on that work if you'd like; otherwise, I'll get started and see if I can have a new PR with these changes out by early next week. |
|
@C0urante wrote:
Sounds good to me! I'm looking forward to your new PR; please link here and ping me. Thanks! |
|
@rhauch (and, if interested, @hachikuji) new PR is up: #11323 |
Jira
When a task fails to commit offsets because all outstanding records haven't been ack'd by the broker yet, it's better to retry that same batch. Otherwise, the set of outstanding records can grow indefinitely and all subsequent offset commit attempts can fail. By retrying the same batch, it becomes possible to eventually commit offsets, even when the producer is unable to keep up with the throughput of the records provided to it by the task.
Two unit tests are added to verify this behavior.
Committer Checklist (excluded from commit message)