KAFKA-12487: Add support for cooperative consumer protocol with sink connectors by C0urante · Pull Request #10563 · apache/kafka

C0urante · 2021-04-19T18:00:23Z

Currently, the WorkerSinkTask's consumer rebalance listener (and related logic) is hardcoded to assume eager rebalancing, which means that all partitions are revoked any time a rebalance occurs and then the set of partitions included in onPartitionsAssigned is assumed to be the complete assignment for the task. Not only does this cause failures when the cooperative consumer protocol is used, it fails to take advantage of the benefits provided by that protocol.

These changes alter framework logic to not only not break when the cooperative consumer protocol is used for a sink connector, but to reap the benefits of it as well, by not revoking partitions unnecessarily from tasks just to reopen them immediately after the rebalance has completed.

This change will be necessary in order to support KIP-726, which currently proposes that the default consumer partition assignor be changed to the CooperativeStickyAssignor.

Two integration tests are added to verify sink task behavior with both eager and cooperative consumer protocols, and new and existing unit tests are adopted as well.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…connectors

C0urante · 2021-04-19T21:20:09Z

@ncliang @gharris1727 do either of you have time to take a look?

@ableegoldman @showuon @guozhangwang FYI, this should be the only change to Connect necessary for KIP-726.

ncliang

Thanks for the PR @C0urante . The overall approach LG. I have just a couple questions.

ncliang · 2021-04-29T08:45:07Z

+            task.close(topicPartitions);
+            if (workerErrantRecordReporter != null) {
+                log.trace("Cancelling reported errors for {}", topicPartitions);
+                workerErrantRecordReporter.cancelFutures(topicPartitions);


I'm not sure if cancelling the outstanding futures for error reporting is the right thing to do here. Would it be reasonable to await their completion for a reasonable amount of time before giving up?

In this case, we've lost the partition assignment and won't be able to commit offsets for them, and since offset commits don't take place before outstanding error reports have completed, we know that no offsets for the records that caused these error reports have been committed either. So, we're guaranteed that the records that the task reported errors for will be redelivered, to whichever task now owns these partitions, regardless of whether we wait here for the reporting to complete.

I see. So the distinction is because in the revoked case we get a chance to await on the errant record reporter futures before we commit offsets for the revoked partitions, but in the lost case we've already lost the partition and don't get a chance to commit offsets. Since we already do not own the partition, we should not be reporting errors for it and should let the current owner take that responsibility. It should be noted that the cancelation is best effort, so there is a chance we duplicate reporting for the errant record.

ncliang · 2021-04-29T09:00:09Z

            // sure anything we paused that the task didn't request to be paused *and* which we still own is resumed.
            // Also make sure our tracking of paused partitions is updated to remove any partitions we no longer own.
-            pausedForRedelivery = false;
+            pausedForRedelivery = pausedForRedelivery && !messageBatch.isEmpty();


I don't know if this change is required. The way I read the current implementation, we make sure that the paused partitions contain only assigned partitions in the block below, setting the paused partitions on context. We then rely on the code block in iteration() to resume partitions that should not be paused.

} else if (!pausedForRedelivery) { resumeAll(); onResume(); }

Setting this to anything other than false causes us not to resume partitions which we own that were not explicitly requested to be paused.

Setting this to anything other than false causes us not to resume partitions which we own that were not explicitly requested to be paused.

Yep, that's intentional. The retry loop for sink tasks works like this:

Grab new batch of records from consumer, store in messageBatch

Deliver batch to task via SinkTask::put

Catch a RetriableException from the task

Pause all partitions on the consumer and set pausedForRedelivery to true

Redeliver batch (stored in the messageBatch field) to task via SinkTask::put continuously

Once a delivery attempt is successful, clear the batch stored in messageBatch and, since pausedForRedelivery is true, resume all partitions on the consumer and reset pausedForDelivery to false

When consumer rebalances meant a mass-revocation and mass-reassignment of partitions, we could get away with unconditionally resetting pausedForRedelivery to true during them, since offsets for any message batches going through the retry loop would not be committed and those records would be guaranteed to be redelivered after the rebalance. However, with incremental consumer rebalances, a task might not be forced to commit offsets for some (or even all) records in its current batch (if, for example, the new assignment only included additional partitions and no revocations). In that case, we have to be careful to remain paused so that the current batch can continue to be retried.

I'm actually wondering if there's a bug in the existing code; not sure pausedForRedelivery should be reset to false after a consumer rebalance. If the worker is in the middle of retrying a failed batch delivery, all the partitions will be paused, and after the rebalance, nothing explicitly resumes them.

During normal execution (when the connector isn't in the middle of redelivering a failed batch) this doesn't surface because partition consumption hasn't been paused anyways. But that assumption breaks down if we've paused to retry a failed batch.

I think it might be necessary to throw in a call to resumeAll() if pausedForRedelivery transitions from true to false as a result of a consumer rebalance (or, if we're lazy, we can just throw it in unconditionally at the end of each consumer rebalance if pausedForRedelivery is false since Consumer::resume is a no-op for partitions that weren't already paused).

Oooooh, one more edge case to consider--if new partitions are assigned while in the middle of retrying a failed batch, right now the worker might accidentally receive new records from those partitions, which would violate the assertion that new records can be received from the consumer if and only if the messageBatch field is empty.

I think we can (and should) handle this by invoking pauseAll() if pausedForRedelivery is still true after removing records from revoked partitions from the current batch.

I've pushed a change that should address this gap and includes a new unit test for the pause/resume logic while paused for redelivery across consumer rebalances.

…or redelivery during rebalance

ncliang

LGTM. Thanks for the PR @C0urante . The changes here are complex with many edge cases, but I think you've covered them with sufficient test coverage.

ncliang · 2021-05-01T03:58:34Z

+            task.close(topicPartitions);
+            if (workerErrantRecordReporter != null) {
+                log.trace("Cancelling reported errors for {}", topicPartitions);
+                workerErrantRecordReporter.cancelFutures(topicPartitions);


I see. So the distinction is because in the revoked case we get a chance to await on the errant record reporter futures before we commit offsets for the revoked partitions, but in the lost case we've already lost the partition and don't get a chance to commit offsets. Since we already do not own the partition, we should not be reporting errors for it and should let the current owner take that responsibility. It should be noted that the cancelation is best effort, so there is a chance we duplicate reporting for the errant record.

kkonstantine

Thanks for the PR @C0urante
Left a few comments

kkonstantine · 2021-07-14T04:23:37Z

                    log.debug("{} Rewinding topic partition {} to offset {}", this, entry.getKey(), entry.getValue().offset());
                    consumer.seek(entry.getKey(), entry.getValue().offset());
                }
-                currentOffsets = new HashMap<>(lastCommittedOffsets);


this change means we never create a fresh copy of the offsets map.
Is this correct? Is there a risk for the offsets to keep being added and never removed?

Good catch. I believe this would lead to incorrect metrics given the use of the currentOffsets field when invoking sinkTaskMetricsGroup::assignedOffsets, and may lead to issues with the end-of-life call to closePartitions as well.

I'll add cleanup logic in the consumer rebalance listener.

kkonstantine · 2021-07-14T04:28:36Z

        commitStarted = now;
        sinkTaskMetricsGroup.recordOffsetSequenceNumber(commitSeqno);

+        Map<TopicPartition, OffsetAndMetadata> lastCommittedOffsets = this.lastCommittedOffsets.entrySet().stream()


shadowing of a member field this way is not ideal.
For example there's an older usage of this collection below. I assume it's the new local variable we want to use, but I can't be 100% sure.

Should we call this actuallyCommittedOffsets or similar?

The "actually committed offsets" (as in, the ones that we pass to the consumer) are computed a bit further on in the method. Maybe lastCommittedOffsetsForPartitions? Bit of a mouthful, will try to think of something better.

kkonstantine · 2021-07-14T04:36:55Z

-            log.trace("Awaiting all reported errors to be completed");
-            workerErrantRecordReporter.awaitAllFutures();
-            log.trace("Completed all reported errors");
+            log.trace("Awaiting reported errors for {} to be completed", topicPartitions);


I wonder if we want to print the actual list of partitions here, which might be long. And do it twice.
I see the same pattern is applied elsewhere. I understand the value of explicit listing.

That's fair; it may be useful to list once but probably not more than that. I'll remove the set of partitions from these log lines and add a single TRACE-level line at the top of the method that logs the set of partitions; at that level, I think the verbosity is acceptable, especially after this change goes out in case we need to do some debugging of unexpected behavior. LMK what you think, though.

kkonstantine · 2021-07-14T04:41:00Z

-                        this, partition, taskProvidedOffset, consumer.assignment());
+                        this, partition, taskProvidedOffset, allAssignedTopicPartitions);
+            } else {
+                log.debug("{} Ignoring task provided offset {}/{} -- topic partition not requested, requested={}",


what's a requested topic partition?
Also, above we mention just partition

Good catch, s/topic partition/partition/

"Requested" here means that although the partition is assigned to the task, it is not one of the partitions that we are currently committing offsets for.

kkonstantine · 2021-07-14T04:48:57Z

        @Override
        public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
            log.debug("{} Partitions assigned {}", WorkerSinkTask.this, partitions);
-            lastCommittedOffsets = new HashMap<>();


similar question as above. This is a map that only grows now. Is this correct?

(Addressed above; it is almost certainly not correct)

kkonstantine · 2021-07-14T04:57:58Z

-            // Make sure we don't have any leftover data since offsets will be reset to committed positions
-            messageBatch.clear();
+            // Make sure we don't have any leftover data since offsets for these partitions will be reset to committed positions
+            messageBatch.removeIf(record -> partitions.contains(new TopicPartition(record.topic(), record.kafkaPartition())));


do you worry that this became more expensive now, especially in cases where we want to remove everything as before?

It certainly will be more expensive, although this will only occur in an edge case where the task has thrown a RetriableException from SinkTask::put, then before a follow-up invocation of SinkTask::put is able to succeed, a consumer rebalance takes place and partitions are revoked.

A naive improvement might be to convert messageBatch to a Map<TopicPartition, List<SinkRecord>>, which would allow us to quickly filter out records belonging to a given topic partition. But that would also be less efficient since we'd have to re-join those lists together before delivering records to the task, and would have to populate that map in the first place after retrieving the original list of records from Consumer::poll.

Not sure if there's a good way around this; if you believe it's a blocker I'm happy to spend some more time on it, though.

kkonstantine · 2021-07-14T05:06:18Z


    // Visible for testing
-    protected final LinkedList<Future<Void>> futures;
+    protected final Map<TopicPartition, Future<Void>> futures;


If we want this to be a concurrent map probably good idea to depict this in the declaration.
(same as you do in the tests below)

Suggested change

protected final Map<TopicPartition, Future<Void>> futures;

protected final ConcurrentMap<TopicPartition, Future<Void>> futures;

I like this style and agree with the suggestion; however, there's actually a bug in the PR right now where we only store one Future at a time per topic-partition, so will address in a follow-up commit instead of just clicking the Commit suggestion button.

…th the same topic partition

C0urante · 2021-07-16T18:22:37Z

Thanks for taking a look @kkonstantine, this is ready for another round.

C0urante · 2021-10-07T02:34:36Z

@kkonstantine Would you mind giving this another pass? It's been over two months.

kkonstantine

Thanks @C0urante
LGTM.

I reran the tests and there were no relevant failures. Merging to trunk, 3.1 and 3.0

…connectors (#10563) Currently, the `WorkerSinkTask`'s consumer rebalance listener (and related logic) is hardcoded to assume eager rebalancing, which means that all partitions are revoked any time a rebalance occurs and then the set of partitions included in `onPartitionsAssigned` is assumed to be the complete assignment for the task. Not only does this cause failures when the cooperative consumer protocol is used, it fails to take advantage of the benefits provided by that protocol. These changes alter framework logic to not only not break when the cooperative consumer protocol is used for a sink connector, but to reap the benefits of it as well, by not revoking partitions unnecessarily from tasks just to reopen them immediately after the rebalance has completed. This change will be necessary in order to support [KIP-726](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=177048248), which currently proposes that the default consumer partition assignor be changed to the `CooperativeStickyAssignor`. Two integration tests are added to verify sink task behavior with both eager and cooperative consumer protocols, and new and existing unit tests are adopted as well. Reviewers: Nigel Liang <nigel@nigelliang.com>, Konstantine Karantasis <k.karantasis@gmail.com>

…ntegration-11-nov * ak/trunk: (15 commits) KAFKA-13429: ignore bin on new modules (apache#11415) KAFKA-12648: introduce TopologyConfig and TaskConfig for topology-level overrides (apache#11272) KAFKA-12487: Add support for cooperative consumer protocol with sink connectors (apache#10563) MINOR: Log client disconnect events at INFO level (apache#11449) MINOR: Remove topic null check from `TopicIdPartition` and adjust constructor order (apache#11403) KAFKA-13417; Ensure dynamic reconfigurations set old config properly (apache#11448) MINOR: Adding a constant to denote UNKNOWN leader in LeaderAndEpoch (apache#11477) KAFKA-10543: Convert KTable joins to new PAPI (apache#11412) KAFKA-12226: Commit source task offsets without blocking on batch delivery (apache#11323) KAFKA-13396: Allow create topic without partition/replicaFactor (apache#11429) ...

…connectors (apache#10563) Currently, the `WorkerSinkTask`'s consumer rebalance listener (and related logic) is hardcoded to assume eager rebalancing, which means that all partitions are revoked any time a rebalance occurs and then the set of partitions included in `onPartitionsAssigned` is assumed to be the complete assignment for the task. Not only does this cause failures when the cooperative consumer protocol is used, it fails to take advantage of the benefits provided by that protocol. These changes alter framework logic to not only not break when the cooperative consumer protocol is used for a sink connector, but to reap the benefits of it as well, by not revoking partitions unnecessarily from tasks just to reopen them immediately after the rebalance has completed. This change will be necessary in order to support [KIP-726](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=177048248), which currently proposes that the default consumer partition assignor be changed to the `CooperativeStickyAssignor`. Two integration tests are added to verify sink task behavior with both eager and cooperative consumer protocols, and new and existing unit tests are adopted as well. Reviewers: Nigel Liang <nigel@nigelliang.com>, Konstantine Karantasis <k.karantasis@gmail.com>

KAFKA-12487: Add support for cooperative consumer protocol with sink …

7708335

…connectors

C0urante force-pushed the kafka-12487 branch from 7d5a72c to 7708335 Compare April 19, 2021 18:17

ncliang reviewed Apr 29, 2021

View reviewed changes

KAFKA-12487: Fix sink task consumer pause/resume logic while paused f…

05aa4f6

…or redelivery during rebalance

ncliang approved these changes May 1, 2021

View reviewed changes

KAFKA-12487: Fix unit test

ec1c6db

kkonstantine reviewed Jul 14, 2021

View reviewed changes

C0urante added 3 commits July 16, 2021 13:39

KAFKA-12487: Fix errant record reporter logic for multiple records wi…

6289b22

…th the same topic partition

KAFKA-12487: Address review feedback

9be00f6

KAFKA-12487: Test cleanup

db1f8f5

kkonstantine added the connect label Oct 19, 2021

kkonstantine approved these changes Nov 10, 2021

View reviewed changes

kkonstantine merged commit 39b1bf4 into apache:trunk Nov 10, 2021

C0urante deleted the kafka-12487 branch November 12, 2021 19:37

C0urante mentioned this pull request Nov 23, 2021

KAFKA-13472: Correct last-committed offsets tracking for sink tasks after partial revocation #11526

Merged

3 tasks

	protected final Map<TopicPartition, Future<Void>> futures;
	protected final ConcurrentMap<TopicPartition, Future<Void>> futures;

Conversation

C0urante commented Apr 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

C0urante commented Apr 19, 2021

Uh oh!

ncliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkonstantine left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante commented Jul 16, 2021

Uh oh!

C0urante commented Oct 7, 2021

Uh oh!

kkonstantine left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

C0urante commented Apr 19, 2021 •

edited

Loading

C0urante Apr 29, 2021 •

edited

Loading

C0urante Apr 29, 2021 •

edited

Loading