KAFKA-9076: support consumer offset sync across clusters in MM 2.0 by ning2008wisc · Pull Request #7577 · apache/kafka

ning2008wisc · 2019-10-22T19:14:31Z

In order to make the Kafka consumer and stream application migrate from source to target cluster
transparently and conveniently, e.g. in event of source cluster failure, a background task is proposed to periodically sync the consumer offsets from the source to target cluster, so that when the consumer and stream applications switch to the target cluster, they will resume to consume from where they left off at source cluster.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0

ryannedolan · 2019-10-30T23:25:34Z

We can use the new reset offsets API for this: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484

new reset offsets API looks awesome, will make this change in the next iteration

ryannedolan · 2019-10-30T23:33:08Z

We should also filter with the configured GroupFilter s.t. only whitelisted groups are sync'd this way. As written, anything with a checkpoint is sync'd, which is problematic when e.g. a user blacklists a group: we should stop sync'ing the group even tho a previous checkpoint may exist.

I think once we move the "consumer offset sync" to MirrorCheckpointTask, the group in the blacklist will be filtered out by the existing logics in MirrorCheckpointTask

ryannedolan · 2019-10-30T23:34:46Z

This is a new public method, so please mention in the KIP.

ryannedolan · 2019-10-30T23:36:14Z

elsewhere in MM2 we refer to "groups" but not "consumers". Should this be sync.group.offsets?

totally agree

ryannedolan · 2019-10-30T23:38:38Z

Can we use the OffsetSyncStore class to do offset translation instead of reading back the checkpoint? The MirrorCheckpointTask class is generating the checkpoint -- why not sync the offsets at the same time?

thanks, it indeed saves efforts, will attempt the consumer offset sync in MirrorCheckpointTask in the next iteration

ryannedolan · 2019-10-30T23:40:55Z

Should we split groups across Tasks instead of syncing all groups in a Connector? For example, MirrorCheckpointConnector divides all groups among MirrorCheckpointTasks, and each Task generates checkpoints for its assigned groups only. This is potentially more scalable.

thanks, that sounds a scalable way, will attempt this change in the next iteration

ning2008wisc · 2019-11-14T08:35:06Z

@ryannedolan revised the pr, please take your time for another pass of review. Thanks

ryannedolan

This looks close, but I don't think we should be sending these syncs in the poll() loop. Instead, let's set up a scheduled task like in the Connectors, with a configurable interval.

ryannedolan · 2019-11-16T01:05:52Z

This would be a side-effect in an otherwise pure function.

ning2008wisc · 2019-11-17T03:01:59Z

@ryannedolan do you suggest to a new connector, like "GroupOffsetSyncConnector" or an existing Connector?

Also do you suggest to do the actual consumer offset sync job as a scheduled task in Connector (like scheduler.scheduleRepeatingDelayed(this::refreshConsumerGroups, internval), or in a new "GroupOffsetSyncTask" that extends SourceTask? I think in your previous comments, you suggested to run the sync job across Tasks instead of syncing all groups in a Connector?

ryannedolan · 2019-11-20T21:21:53Z

do you suggest...

I was thinking we'd add a Scheduler to MirrorCheckpointTask (not the Connector) and periodically write offsets from there. MirrorCheckpointTask already has an OffsetSyncStore, which lets you translate offsets for any group assigned to the Task. So you just need to loop thru the assigned groups, translate their offsets, and write them downstream.

This is very close to what you have now -- just that we should do it periodically (in a Scheduler), not part of the poll() loop.

I don't think we should create a new Connector or Task, since all the info we need (OffsetSyncStore, group assignment) is already in MirrorCheckpointTask.

ning2008wisc · 2019-11-21T00:44:55Z

@ryannedolan thanks for your valuable and concrete feedback. I did the change that you may expect and please take another review when you have time. Thanks

ryannedolan

I think you are missing a close()/shutdown(), but otherwise lgtm.

ryannedolan · 2019-11-22T17:39:41Z

Nit: the scheduler understands negative intervals as "disabled", so you don't need to have this extra check here. Just have syncGroupOffsetInterval return -1 if it's disabled.

removed the if condition

ryannedolan · 2019-11-22T17:41:13Z

ryannedolan · 2019-11-22T17:49:02Z

Why skip the entire checkpoint if a single downstream partition is ahead of the checkpoint? I guess that's the safest approach -- but are we sure we can't just skip that partition and write the rest?

but are we sure we can't just skip that partition and write the rest?
the latest version will skip such kind of partition and write the rest

ning2008wisc · 2019-11-22T20:32:06Z

@ryannedolan thanks for another set of review feedback, updated the pr based on your latest comments

ryannedolan · 2019-12-04T16:25:34Z

This part is duplicated -- can we make DRYer? I suggest adding a checkpointsForGroup() method, and use that in both places.

ryannedolan · 2019-12-05T00:39:09Z

It might be slightly more efficient to describe all consumerGroups once (outside the for-loop) and then use the resulting map here.

nils-getdreams · 2020-01-14T10:24:55Z

Can I upvote so this ticet gets prioritized :)

ning2008wisc · 2020-01-16T20:11:13Z

@nils-getdreams sounds like you may be interested in this feature. @ryannedolan may be testing it in house. How about cherry-pick this into your fork and let me know any feedback based on your test?

amuraru · 2020-01-29T13:26:29Z

@ning2008wisc @ryannedolan with this automated CO sync I was wondering if the user can also set an option to ensure the mirrored topics in the target cluster are also created without the src cluster prefix.
and really have topic_name <-> topic_name in Active and Passive clusters.
With this PR alone the consumers will still require to subscribe to *-topic_name and be tolerant to cluster failover right?

ning2008wisc · 2020-01-29T22:29:19Z

@amuraru thanks for your comments. To directly answer your first question, I think currently no, meaning the src cluster prefix has to be added. The reason is detailed here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0#KIP-382:MirrorMaker2.0-Cycledetection

The consumer will still require to subscribe to *-topic_name, but the group id is kept same, rather than adding src cluster prefix to the group id

bseenu · 2020-05-05T16:55:55Z

I would like to propose the following change to take care of the source consumer group changes

Suggested change

for (Entry<TopicPartition, OffsetAndMetadata> entry : group.getValue()) {

long latestDownstreamOffset = entry.getValue().offset();

TopicPartition topicPartition = entry.getKey();

if (!convertedUpstreamOffset.containsKey(topicPartition)) {

log.trace("convertedUpstreamOffset does not contain TopicPartition: {}", topicPartition.toString());

continue;

}

// if translated offset from upstream is smaller than the current consumer offset

// in the target, skip updating the offset for that partition

long convertedOffset = convertedUpstreamOffset.get(topicPartition).offset();

if (latestDownstreamOffset >= convertedOffset) {

log.trace("latestDownstreamOffset {} is larger than convertedUpstreamOffset {} for "

+ "TopicPartition {}", latestDownstreamOffset, convertedOffset, topicPartition);

continue;

}

offsetToSync.put(entry.getKey(), convertedUpstreamOffset.get(topicPartition));

for (Map.Entry<TopicPartition, OffsetAndMetadata> convertedEntry : convertedUpstreamOffset.entrySet()) {

TopicPartition topicPartition = convertedEntry.getKey();

for (Entry<TopicPartition, OffsetAndMetadata> idleEntry : group.getValue()) {

if (idleEntry.getKey() == topicPartition) {

long latestDownstreamOffset = idleEntry.getValue().offset();

// if translated offset from upstream is smaller than the current consumer offset

// in the target, skip updating the offset for that partition

long convertedOffset = convertedUpstreamOffset.get(topicPartition).offset();

if (latestDownstreamOffset >= convertedOffset) {

log.trace("latestDownstreamOffset {} is larger than convertedUpstreamOffset {} for "

+ "TopicPartition {}", latestDownstreamOffset, convertedOffset, topicPartition);

continue;

}

}

}

offsetToSync.put(convertedEntry.getKey(), convertedUpstreamOffset.get(topicPartition));

ning2008wisc · 2020-05-06T01:39:55Z

hi @bseenu @thspinto I updated the PR to handle both cases: (1) new consumer (2) new topic/partition in existing consumer. Please go ahead if you want to test it out sooner. Meanwhile I am writing some integration tests as well to test both cases. Thanks for your feedback

ning2008wisc · 2020-05-22T07:01:41Z

@ryannedolan @mimaison I added the Integration tests for testing this automated consumer offset sync in MM 2.0. When available, I am appreciated for your first pass of review. Thanks

mimaison

Thanks for the updates. I've made another pass and left a few comments

mimaison · 2020-06-04T09:57:28Z

can we move these imports with the other java.util imports?

mimaison · 2020-06-04T10:44:10Z

can we change the type definition of these 2 to be Admin? Then we don't need the cast

mimaison · 2020-06-04T10:49:06Z

It looks like we describe all groups just to get groups in the EMPTY state. Can we use the new listGroups() method introduced in KIP-518 to only get groups in that specific state?

On second though, using describeConsumerGroups() may be more predictable in terms on work to do, as you describe only the groups assgined to this task

great to know that KIP, then I will keep using describeConsumerGroups() here

mimaison · 2020-06-04T10:59:35Z

larger or equal?

updated to latestDownstreamOffset {} is larger than or equal to convertedUpstreamOffset {} for....

mimaison · 2020-06-04T11:01:41Z

Can we use the existing constants for the config names?

mimaison · 2020-06-04T11:03:48Z

We could use Collections.emptyMap() here and in a few places below

mimaison · 2020-06-04T11:05:16Z

This looks unused, same below for c2t2p0

removed the unused

mimaison · 2020-06-04T11:06:28Z

We can use new OffsetAndMetadata(50) if we don't set any metadata. same below

mimaison · 2020-06-04T11:10:08Z

It's a bit unusual to have consumer3 and consumer4 without 1 and 2 =)

updated to consumer1 and consumer2

mimaison · 2020-06-04T11:11:55Z

what about assertEquals("consumer record size is not zero", 0, records.count());? It can also be applied in a few other places

ning2008wisc · 2020-06-05T06:00:03Z

Hello @mimaison thanks for your comments. I have addressed them in the latest push and please take another review. Thanks

ning2008wisc · 2020-06-24T21:52:58Z

bump for attention @mimaison ^ given that https://issues.apache.org/jira/browse/KAFKA-9076 is slipped to the next release (2.7.0) and some people may be already testing/using this feature, I would hope if it is possible to revisit this PR soon so that it can formally part of Kafka. Thanks

mimaison · 2020-06-25T09:14:58Z

ok to test

mimaison

Thanks for the updates. I've taken another look and left a few more minor comments. I think I'd be happy to merge it once these are addressed

mimaison · 2020-06-25T09:53:14Z

We should also close targetAdminClient

mimaison · 2020-06-25T09:54:00Z

Can we use Admin instead of AdminClient for both of these?

mimaison · 2020-06-25T12:59:58Z

Use Time.SYSTEM instead of creating a new instance

In order to make the Kafka consumer and stream application migrate from source to target cluster transparently and conveniently, e.g. in event of source cluster failure, a background job is proposed to periodically sync the consumer offsets from the source to target cluster, so that when the consumer and stream applications switche to the target cluster, they will resume to consume from where they left off at source cluster.

ning2008wisc · 2020-06-25T14:52:22Z

@mimaison thanks so much for your fast response :) I have addressed your above 3 comments and please take the final review pass

mimaison · 2020-06-25T15:37:32Z

retest this please

mimaison · 2020-06-25T15:41:22Z

ok to test

ning2008wisc · 2020-06-25T18:25:43Z

unrelated test failures:

 kafka.api.SaslGssapiSslEndToEndAuthorizationTest > testProduceConsumeWithWildcardAcls FAILED
org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest > testReplication FAILED (though in the same file that got changed, but testReplication is not involved this PR)
kafka.api.PlaintextConsumerTest > testLowMaxFetchSizeForRequestAndPartition FAILED
org.apache.kafka.streams.integration.EosIntegrationTest > shouldNotViolateEosIfOneTaskFailsWithState[exactly_once] FAILED

ning2008wisc · 2020-06-25T18:25:57Z

retest this please

ning2008wisc · 2020-06-25T18:30:36Z

ok to test

mimaison · 2020-06-25T18:53:06Z

Some of the failures look related:

18:04:25 org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest > testReplication FAILED
18:04:25     java.lang.RuntimeException: Could not find enough records. found 0, expected 100
18:04:25         at org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.consume(EmbeddedKafkaCluster.java:435)
18:04:25         at org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication(MirrorConnectorsIntegrationTest.java:221)

Let's retest

mimaison · 2020-06-25T18:53:39Z

ok to test

ning2008wisc · 2020-06-25T21:30:38Z

seems unrelated test failed:

kafka.api.PlaintextConsumerTest > testLowMaxFetchSizeForRequestAndPartition FAILED

ning2008wisc · 2020-06-26T14:29:33Z

@mimaison If the one failed test is not relevant, are we ready to merge, or anything I can do? Thanks

ning2008wisc · 2020-06-26T15:39:01Z

Huge thanks to all reviewers and committers for providing valuable comments and testing results

* 'trunk' of github.com:apache/kafka: KAFKA-10180: Fix security_config caching in system tests (apache#8917) KAFKA-10173: Fix suppress changelog binary schema compatibility (apache#8905) KAFKA-10166: always write checkpoint before closing an (initialized) task (apache#8926) MINOR: Rename SslTransportLayer.State."NOT_INITALIZED" enum value to "NOT_INITIALIZED" MINOR: Update Scala to 2.13.3 (apache#8931) KAFKA-9076: support consumer sync across clusters in MM 2.0 (apache#7577) MINOR: Remove Diamond and code code Alignment (apache#8107) KAFKA-10198: guard against recycling dirty state (apache#8924)

ning2008wisc changed the title ~~KAFKA-9076: support consumer sync across clusters in MM 2.0~~ KAFKA-9076: support consumer offset sync across clusters in MM 2.0 Oct 23, 2019

ryannedolan reviewed Oct 30, 2019

View reviewed changes

ning2008wisc force-pushed the trunk branch 7 times, most recently from b6858b2 to 4d0477b Compare November 14, 2019 08:33

ryannedolan suggested changes Nov 16, 2019

View reviewed changes

ning2008wisc force-pushed the trunk branch 2 times, most recently from 507648a to 5f5694a Compare November 21, 2019 00:38

ryannedolan suggested changes Nov 22, 2019

View reviewed changes

ning2008wisc force-pushed the trunk branch from 5f5694a to 3898f20 Compare November 22, 2019 20:27

ning2008wisc force-pushed the trunk branch 2 times, most recently from 8d962d7 to ae1ce2f Compare November 22, 2019 21:26

ryannedolan reviewed Dec 5, 2019

View reviewed changes

ning2008wisc force-pushed the trunk branch from ae1ce2f to bb7285a Compare December 5, 2019 03:00

ning2008wisc force-pushed the trunk branch 2 times, most recently from cce6116 to 7cd2419 Compare December 19, 2019 18:48

ning2008wisc force-pushed the trunk branch from 7cd2419 to 0a7de27 Compare February 12, 2020 19:06

bseenu reviewed May 5, 2020

View reviewed changes

ning2008wisc force-pushed the trunk branch from 452aa87 to 7033d92 Compare May 6, 2020 01:40

ning2008wisc force-pushed the trunk branch 2 times, most recently from 81b41bc to 35a7b1f Compare May 22, 2020 06:57

mimaison requested changes Jun 4, 2020

View reviewed changes

ning2008wisc force-pushed the trunk branch from 35a7b1f to 6724396 Compare June 5, 2020 05:58

mimaison requested changes Jun 25, 2020

View reviewed changes

ning2008wisc force-pushed the trunk branch from 6724396 to eaa63e3 Compare June 25, 2020 14:49

mimaison self-requested a review June 25, 2020 15:37

mimaison approved these changes Jun 25, 2020

View reviewed changes

mimaison merged commit 9c9a79b into apache:trunk Jun 26, 2020

-            for (Entry<TopicPartition, OffsetAndMetadata> entry : group.getValue()) {
-                long latestDownstreamOffset = entry.getValue().offset();
-                TopicPartition topicPartition = entry.getKey();
-                if (!convertedUpstreamOffset.containsKey(topicPartition)) {
-                    log.trace("convertedUpstreamOffset does not contain TopicPartition: {}", topicPartition.toString());
-                    continue;
-                }
-                // if translated offset from upstream is smaller than the current consumer offset
-                // in the target, skip updating the offset for that partition
-                long convertedOffset = convertedUpstreamOffset.get(topicPartition).offset();
-                if (latestDownstreamOffset >= convertedOffset) {
-                    log.trace("latestDownstreamOffset {} is larger than convertedUpstreamOffset {} for "
-                        + "TopicPartition {}", latestDownstreamOffset, convertedOffset, topicPartition);
-                    continue;
-                }
-                offsetToSync.put(entry.getKey(), convertedUpstreamOffset.get(topicPartition));
+             for (Map.Entry<TopicPartition, OffsetAndMetadata> convertedEntry : convertedUpstreamOffset.entrySet()) {
+                TopicPartition topicPartition = convertedEntry.getKey();
+                for (Entry<TopicPartition, OffsetAndMetadata> idleEntry : group.getValue()) {
+                    if (idleEntry.getKey() == topicPartition) {
+                        long latestDownstreamOffset = idleEntry.getValue().offset();
+                        // if translated offset from upstream is smaller than the current consumer offset
+                        // in the target, skip updating the offset for that partition
+                        long convertedOffset = convertedUpstreamOffset.get(topicPartition).offset();
+                        if (latestDownstreamOffset >= convertedOffset) {
+                            log.trace("latestDownstreamOffset {} is larger than convertedUpstreamOffset {} for "
+                                + "TopicPartition {}", latestDownstreamOffset, convertedOffset, topicPartition);
+                            continue;
+                        }
+                    }
+                }
+                offsetToSync.put(convertedEntry.getKey(), convertedUpstreamOffset.get(topicPartition));

Conversation

ning2008wisc commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ning2008wisc commented Nov 14, 2019

Uh oh!

ryannedolan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ning2008wisc commented Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryannedolan commented Nov 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ning2008wisc commented Nov 21, 2019

Uh oh!

ryannedolan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ning2008wisc commented Nov 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nils-getdreams commented Jan 14, 2020

Uh oh!

ning2008wisc commented Jan 16, 2020

Uh oh!

amuraru commented Jan 29, 2020

Uh oh!

ning2008wisc commented Jan 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ning2008wisc commented May 6, 2020

Uh oh!

ning2008wisc commented May 22, 2020

Uh oh!

mimaison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ning2008wisc commented Oct 22, 2019 •

edited

Loading

ning2008wisc commented Nov 17, 2019 •

edited

Loading

ryannedolan commented Nov 20, 2019 •

edited

Loading