KAFKA-12226: Commit source task offsets without blocking on batch delivery by C0urante · Pull Request #11323 · apache/kafka

C0urante · 2021-09-13T18:53:37Z

Replaces #10112

Replaces the current batch-based logic for offset commits with a dynamic, non-blocking approach outlined in discussion on #10112 here, here, here, here, and here.

Essentially, a deque is kept for every source partition that a source task produces records for, and each element in that deque is a SubmittedRecord with a flag to track whether the producer has ack'd the delivery of that source record to Kafka yet. Periodically, the worker (on the same thread that polls the source task for records and transforms, converts, and dispatches them to the producer) polls acknowledged elements from the beginning of each of these deques and collects the latest offsets from these elements, storing them in a snapshot that is then committed on the separate source task offset thread.

The behavior of the offset.flush.timeout.ms property is retained, but essentially now only applies to the actual writing of offset data to the internal offsets topic (if running in distributed mode) or the offsets file (if running in standalone mode). No time is spent during WorkerSourceTask::commitOffsets blocking on the acknowledgment of records by the producer.

It's possible that memory exhaustion may occur if, for example, a single Kafka partition is offline for an extended period. In cases like this, the collection of deques in the SubmittedRecords class may continue to grow indefinitely until the partition comes back online and the SubmittedRecords in those deques that targeted the formerly-offline Kafka partition are acknowledged and can be removed. Although this may be suboptimal, it is no worse than the existing behavior of the framework in these cases.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…ivery

C0urante · 2021-09-13T18:54:37Z

CC @rhauch; hopefully this is fairly close to what you had in mind.

rhauch

Very nice job, @C0urante. Overall this is exactly along the lines of what I was suggesting, and I even like some improvements you made, like calling updateCommittableOffsets() from the WorkerSourceTask.execute() method.

Most of my comments are minor nits to clarify/improve phrasing in JavaDoc, or to expand the unit tests a bit. One comment/question is about saving some effort when there are no offsets.

Otherwise, looks great and almost ready to merge.

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

C0urante · 2021-10-01T17:47:32Z

Thanks @rhauch, I've addressed all of the comments that seemed straightforward and left a response on the ones where a bit of discussion seems warranted before making changes. This is ready for another round.

rhauch

Thanks, @C0urante. This looks really good. I have a few more minor suggestions and a few questions.

rhauch · 2021-10-07T18:38:12Z

                }

                maybeThrowProducerSendException();
+                updateCommittableOffsets();


Actually, I now have a question: why did you choose to add it before the poll() (a few lines down) rather than after, perhaps after the if (!sendRecords()l) {...} block below?

The reason I ask is that if one loop of the while polls for records and sends them (where they are sent to the producer and asynchronously acked), but then the connector is paused about the same time, then the offsets for those records will not be committed until after the connector is resumed above. Is that intentional?

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

C0urante · 2021-10-11T18:39:24Z

Thanks @rhauch. Ready for another round when you have time.

rhauch

Thanks, @C0urante. Just a few questions below.

rhauch · 2021-10-12T14:55:36Z

+
+                updateCommittableOffsets();
+


Sorry, maybe I wasn't clear in my [previous comment about this call[(https://github.com//pull/11323#discussion_r724437761). I think there is an edge case here that we could deal with a bit better. Consider the following scenario as we walk through the loop in execute(). Th eWorkerSourceTask is not paused, and has been sending and committing offsets for records.

On some pass through the execute() while loop:

shouldPause() returns false

maybeThrowProducerSendException() does nothing since no exception was set from the producer

poll() is called to get new records from the source task;

updateCommittableOffsets() is called to update the committableOffsets map for any records sent in previous loops that have been acked

sendRecords() is called with the records retrieved in step 3 earlier in this same pass, which for each of these new records enqueues a SubmittedRecord and calls producer.send(...) on each record with a callback that acks the submitted record.

But just after step 1 in the aforementioned pass, the connector and its tasks are paused. This means that the next pass through the WorkerSourceTask.execute() while loop:

shouldPause() returns true, so

onPause() is called and awaitUnpause() is called.

At that point, the thread blocks. But the records that were send to the producer in step 5 of the previous pass may have already been acked, meaning we could have update the offsets just before we paused. That might not have been enough time for all of the records submitted in that step to be acked, but if we were to move the updateCommittableOffsets() to just before the if (shouldPause()) check then we will get the offsets for as many acked records as possible just before the thread will pause.

In all other non-paused scenarios, I'm not sure it matters where in this loop we call updateCommittableOffsets(). But for the just-been-paused scenario, moving it to the first (or last) operation in the loop gives us a bit more of a chance to commit the offsets for as many acked records as possible.

WDYT?

Ugh, sorry. Your initial point was very clear, although I really appreciate the detailed writeup here. It was an implementation snafu. I wanted to handle the case where poll produced no records, which meant invoking updateCommittableOffsets before the if (toSend == null) continue; section. Of course, that didn't actually address the original concern, which is that we may miss a chance to update offsets for records just-dispatched to the producer in sendRecords.

I like the idea of placing updateCommittableOffsets right before the if (shouldPause()) check, at the top of the loop; will do.

rhauch · 2021-10-12T15:36:23Z

@@ -378,7 +370,7 @@ private boolean sendRecords() {
                            log.trace("{} Failed record: {}", WorkerSourceTask.this, preTransformRecord);
                            producerSendException.compareAndSet(null, e);


We're not calling submittedRecords.removeLastOccurrence(submittedRecord) here. Were you thinking that we're setting the producerSendException, which will cause the execute() method to throw this same exception on the next pass and consequently fail the task?

I think that's the right choice and no changes are required, but I do need to work through it. So pardon my thought process here.

The question is: what happens to records (and SubmittedRecord objects and their offsets) that appear after the record that resulted in the asynchronous exception?

What happens depends on what the producer behavior is, or might be in the future. IIRC the exceptions will often be unrecoverable, but it is possible that records could be sent successfully even if they were submitted to the producer after the record that failed, especially when those records were sent to a different topic partition and were actually sent by the producer before the record that failed. After all, from the producer.send() JavaDoc:

Callbacks for records being sent to the same partition are guaranteed to execute in order.

Unfortunately, we cannot infer a relationship between the topic partition for a record and its source partition. So any subsequent records that were sent to a different topic partition could still have the same source partition, and thus they should be enqueued into the same deque. Those offsets would not be committed, since their SubmittedRecord instances are after the SubmittedRecord for the record that failed to send, and the latter would never be acked (as its send failed).

But if any subsequent records were sent to a different topic partition but had a different source partition, their SubmittedRecord instances would be in a different deque than the SubmittedRecord for the record that failed to send, and their offsets could potentially be committed.

If the committed offsets were moved as suggested in a separate thread above, we'd actually get a chance to commit offsets for acked source records before failing the task. It's not super essential, but it'd be good to commit the offsets for as many of those submitted-and-acked records as possible.

So any subsequent records that were sent to a different topic partition could still have the same source partition, and thus they should be enqueued into the same deque. Those offsets would not be committed, since their SubmittedRecord instances are after the SubmittedRecord for the record that failed to send, and the latter would never be acked (as its send failed).

I think this is the "vital" section and it provides a good rationale for why we intentionally keep the failed record in the queue.

If the committed offsets were moved as suggested in a separate thread above, we'd actually get a chance to commit offsets for acked source records before failing the task. It's not super essential, but it'd be good to commit the offsets for as many of those submitted-and-acked records as possible.

We call commitOffsets in a finally block for execute right now. I think we can address this case by adding another call to updateCommittableOffsets right before this end-of-life call to commitOffsets. I've done this; LMKWYT.

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

kkonstantine

This is a nice improvement in a part of the code that seemed to really need some modernization. I have one comment regarding the use of multiple deques vs single deque.

rhauch · 2021-10-28T15:49:13Z

+    public Map<Map<String, Object>, Map<String, Object>> committableOffsets() {
+        Map<Map<String, Object>, Map<String, Object>> result = new HashMap<>();
+        records.forEach((partition, queuedRecords) -> {
+            if (canCommitHead(queuedRecords)) {
+                Map<String, Object> offset = committableOffset(queuedRecords);
+                result.put(partition, offset);
+            }
+        });
+        // Clear out all empty deques from the map to keep it from growing indefinitely
+        records.values().removeIf(Deque::isEmpty);
+        return result;
+    }


@C0urante, right now we have no visibility into the number or size of deques. We can't add a metric without a KIP, but WDYT about adding some DEBUG and/or TRACE log messages here? The benefit of here rather than in the WorderSourceTask is that it would be much easier to enable DEBUG or TRACE for only these log messages. One disadvantage is that this committableOffsets() method is calls once per iteration in the WorkerSourceTask.execute() method.

I guess an alternative might be to add a method (e.g., toString()?) that output this information, and then put the log messages in WorkerSinkTask.commitOffsets().

Thoughts?

I agree with your concerns about excess logging if a message is added to the WorkerSourceTask::execute loop.

Since we're removing this log message in this PR, I wonder if we can replace it with something similar? I think users may want to know how many total pending (i.e., unacked) messages there are, how many deques there are, and the number of messages in the largest deque (which may be useful for identifying "stuck" topic partitions).

I'll take a shot at this; LMKWYT.

rhauch

Good idea with making Pending be a memento of the offset, and with calling out the existing log message that we should replace. A few suggestions below to hopefully simplify things even more.

rhauch · 2021-11-01T19:38:45Z

+            log.info("There are currently {} pending messages spread across {} source partitions whose offsets will not be committed. "
+                            + "The source partition with the most pending messages is {}, with {} pending messages",
+                    pendingMetadataForCommit.totalPendingMessages(),
+                    pendingMetadataForCommit.numDeques(),
+                    pendingMetadataForCommit.largestDequePartition(),
+                    pendingMetadataForCommit.largestDequeSize()
+            );
+        } else {
+            log.info("There are currently no pending messages for this offset commit; all messages since the last commit have been acknowledged");


As you point out, the old log message was:

log.info("{} flushing {} outstanding messages for offset commit", this, outstandingMessages.size());

This log message had two things it'd be nice to keep:

this as the context; and

the number of records whose offsets were being committed (e.g., the number of acked records).

I think both would be good to include, especially if we're saying the number of records whose offsets are not being committed (yet).

The Pending class seems pretty useful, but computing the number of acked records is not possible here. WDYT about merging the SumittedRecords.committableOffsets() and pending() methods, by having the former return an object that contains the offset map and the metadata that can be used for logging? This class would be like Pending, though maybe CommittableOffsets is a more apt name. Plus, WorkerSourceTask would only have one volatile field that is updated atomically.

👍 SGTM. I've updated the PR accordingly.

One nit: the "flushing outstanding messages for offset commit" message actually refers to the number of unacked messages in the current batch and not the number of acknowledged messages for which offsets will be committed; this has tripped up many of my colleagues who see "flushing 0 outstanding messages" and think their source connector isn't producing any data when all it really means is that its producers are keeping up with the throughput of its tasks very well.

I think both pieces of information (number of acked and unacked messages) are useful here so I've included both in the latest draft.

rhauch

Thanks, @C0urante, for the recent improvements to logging, and for this PR. Everything else looks great.

…ntegration-11-nov * ak/trunk: (15 commits) KAFKA-13429: ignore bin on new modules (apache#11415) KAFKA-12648: introduce TopologyConfig and TaskConfig for topology-level overrides (apache#11272) KAFKA-12487: Add support for cooperative consumer protocol with sink connectors (apache#10563) MINOR: Log client disconnect events at INFO level (apache#11449) MINOR: Remove topic null check from `TopicIdPartition` and adjust constructor order (apache#11403) KAFKA-13417; Ensure dynamic reconfigurations set old config properly (apache#11448) MINOR: Adding a constant to denote UNKNOWN leader in LeaderAndEpoch (apache#11477) KAFKA-10543: Convert KTable joins to new PAPI (apache#11412) KAFKA-12226: Commit source task offsets without blocking on batch delivery (apache#11323) KAFKA-13396: Allow create topic without partition/replicaFactor (apache#11429) ...

…ivery (#11323) Replaces the current logic for committing source offsets, which is batch-based and blocks until the entirety of the current batch is fully written to and acknowledged by the broker, with a new non-blocking approach that commits source offsets for source records that have been "fully written" by the producer. The new logic consider a record fully written only if that source record and all records before it with the same source partition have all been written to Kafka and acknowledged. This new logic uses a deque for every source partition that a source task produces records for. Each element in that deque is a SubmittedRecord with a flag to track whether the producer has ack'd the delivery of that source record to Kafka. Periodically, the worker (on the same thread that polls the source task for records and transforms, converts, and dispatches them to the producer) polls acknowledged elements from the beginning of each of these deques and collects the latest offsets from these elements, storing them in a snapshot that is then committed on the separate source task offset thread. The behavior of the `offset.flush.timeout.ms property` is retained, but essentially now only applies to the actual writing of offset data to the internal offsets topic (if running in distributed mode) or the offsets file (if running in standalone mode). No time is spent during `WorkerSourceTask::commitOffsets` waiting on the acknowledgment of records by the producer. This behavior also does not change how the records are dispatched to the producer nor how the producer sends or batches those records. It's possible that memory exhaustion may occur if, for example, a single Kafka partition is offline for an extended period. In cases like this, the collection of deques in the SubmittedRecords class may continue to grow indefinitely until the partition comes back online and the SubmittedRecords in those deques that targeted the formerly-offline Kafka partition are acknowledged and can be removed. Although this may be suboptimal, it is no worse than the existing behavior of the framework in these cases. Author: Chris Egerton <chrise@confluent.io> Reviewed: Randall Hauch <rhauch@gmail.com>

…ivery (apache#11323) Replaces the current logic for committing source offsets, which is batch-based and blocks until the entirety of the current batch is fully written to and acknowledged by the broker, with a new non-blocking approach that commits source offsets for source records that have been "fully written" by the producer. The new logic consider a record fully written only if that source record and all records before it with the same source partition have all been written to Kafka and acknowledged. This new logic uses a deque for every source partition that a source task produces records for. Each element in that deque is a SubmittedRecord with a flag to track whether the producer has ack'd the delivery of that source record to Kafka. Periodically, the worker (on the same thread that polls the source task for records and transforms, converts, and dispatches them to the producer) polls acknowledged elements from the beginning of each of these deques and collects the latest offsets from these elements, storing them in a snapshot that is then committed on the separate source task offset thread. The behavior of the `offset.flush.timeout.ms property` is retained, but essentially now only applies to the actual writing of offset data to the internal offsets topic (if running in distributed mode) or the offsets file (if running in standalone mode). No time is spent during `WorkerSourceTask::commitOffsets` waiting on the acknowledgment of records by the producer. This behavior also does not change how the records are dispatched to the producer nor how the producer sends or batches those records. It's possible that memory exhaustion may occur if, for example, a single Kafka partition is offline for an extended period. In cases like this, the collection of deques in the SubmittedRecords class may continue to grow indefinitely until the partition comes back online and the SubmittedRecords in those deques that targeted the formerly-offline Kafka partition are acknowledged and can be removed. Although this may be suboptimal, it is no worse than the existing behavior of the framework in these cases. Author: Chris Egerton <chrise@confluent.io> Reviewed: Randall Hauch <rhauch@gmail.com>

KAFKA-12226: Commit source task offsets without blocking on batch del…

63b2cd0

…ivery

C0urante mentioned this pull request Sep 13, 2021

KAFKA-12226: Prevent source task offset failure when producer is overwhelmed #10112

Closed

3 tasks

KAFKA-12226: Make WorkerSourceTask.offsets field volatile

8ef115d

rhauch reviewed Sep 28, 2021

View reviewed changes

C0urante and others added 6 commits October 1, 2021 12:28

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

778616f

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

2f11854

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

123ac72

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

a9cb6cd

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

b48c6b6

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Address review comments

9738e68

rhauch reviewed Oct 7, 2021

View reviewed changes

C0urante and others added 2 commits October 11, 2021 14:05

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

09b5830

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Address review comments

a829046

rhauch reviewed Oct 12, 2021

View reviewed changes

C0urante and others added 2 commits October 13, 2021 09:15

Update connect/runtime/src/main/java/org/apache/kafka/connect/runtime…

7cda964

…/SubmittedRecords.java Co-authored-by: Randall Hauch <rhauch@gmail.com>

Address review comments

286d78a

kkonstantine added the connect label Oct 19, 2021

kkonstantine reviewed Oct 20, 2021

View reviewed changes

Comment thread connect/runtime/src/main/java/org/apache/kafka/connect/runtime/SubmittedRecords.java

rhauch reviewed Oct 28, 2021

View reviewed changes

Add log message for information on pending messages during offset commit

dd7f956

rhauch reviewed Nov 1, 2021

View reviewed changes

C0urante added 2 commits November 2, 2021 12:42

Consolidate source task offsets and metadata into single snapshot class

7ad9d0e

Statically import SubmittedRecords.CommittableOffsets class

5e520ed

rhauch approved these changes Nov 7, 2021

View reviewed changes

rhauch merged commit c1bdfa1 into apache:trunk Nov 7, 2021

C0urante deleted the kafka-12226 branch November 9, 2021 20:58

This was referenced Nov 16, 2021

KAFKA-12497: Skip periodic offset commits for failed source tasks #10528

Merged

KAFKA-13469: Block for in-flight record delivery before end-of-life source task offset commit #11524

Merged

C0urante mentioned this pull request Jan 15, 2022

KAFKA-10000: Exactly-once support for source connectors (KIP-618) #10907

Closed

3 tasks

derekha2010 mentioned this pull request Mar 30, 2022

How to configure cron so file is not processed multiple times (duplicate file) mmolimar/kafka-connect-fs#91

Open

		@@ -378,7 +370,7 @@ private boolean sendRecords() {
		log.trace("{} Failed record: {}", WorkerSourceTask.this, preTransformRecord);
		producerSendException.compareAndSet(null, e);

Conversation

C0urante commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

C0urante commented Sep 13, 2021

Uh oh!

rhauch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

C0urante commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

C0urante commented Oct 11, 2021

Uh oh!

rhauch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkonstantine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante Nov 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

C0urante commented Sep 13, 2021 •

edited

Loading

rhauch left a comment •

edited

Loading

C0urante commented Oct 1, 2021 •

edited

Loading

rhauch left a comment •

edited

Loading

C0urante Nov 2, 2021 •

edited

Loading