KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode by showuon · Pull Request #11631 · apache/kafka

showuon · 2021-12-30T10:14:11Z

After KAFKA-10793, we clear the findCoordinatorFuture in 2 places:

1. heartbeat thread
2. AbstractCoordinator#ensureCoordinatorReady

But in non consumer group mode with group id provided (for offset commitment. So that there will be consumerCoordinator created), there will be no (1)heartbeat thread , and it only call (2)AbstractCoordinator#ensureCoordinatorReady when 1st time consumer wants to fetch committed offset position. That is, after 2nd lookupCoordinator call, we have no chance to clear the findCoordinatorFuture , and causes the offset commit never succeeded.

To avoid the race condition as KAFKA-10793 mentioned, it's not safe to clear the findCoordinatorFuture in the future listener. So, I think we can fix this issue by calling AbstractCoordinator#ensureCoordinatorReady when coordinator unknown in non consumer group case, under each ConsumerCoordinator#poll.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…le FindCoordinatorFuture correctly

showuon · 2021-12-30T13:42:35Z

-            // For manually assigned partitions, if there are no ready nodes, await metadata.
+            // For manually assigned partitions, if coordinator is unknown, make sure we lookup one and await metadata.
            // If connections to all nodes fail, wakeups triggered while attempting to send fetch
            // requests result in polls returning immediately, causing a tight loop of polls. Without
            // the wakeup, poll() with no channels would block for the timeout, delaying re-connection.
-            // awaitMetadataUpdate() initiates new connections with configured backoff and avoids the busy loop.
-            // When group management is used, metadata wait is already performed for this scenario as
-            // coordinator is unknown, hence this check is not required.
-            if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
-                client.awaitMetadataUpdate(timer);
+            // awaitMetadataUpdate() in ensureCoordinatorReady initiates new connections with configured backoff and avoids the busy loop.
+            if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
+                return false;


Before the change, we will wait for the metadata update if no nodes available, to avoid busy loop when in non consumer group mode. After the change, we did as the group management did, to call ensureCoordinatorReady when coordinator unknown. And in ensureCoordinatorReady. This way, we can also make sure to handle the FindCoordinatorFuture well (and clear it) inside ensureCoordinatorReady.

I'm wondering if this fix is a bit overkill, e.g. for those consumers who do not even set the group ID and do not rely on the coordinator for anything, the coordinatorUnknown() would always return true while ensureCoordinatorReady would send a discover coordinator request with null group id.

showuon · 2021-12-30T13:43:46Z

@guozhangwang @ableegoldman , could you please take a look to see if this fix makes sense to you? I'll fix the broken tests after your first review. Thank you.

guozhangwang

Hi @showuon thanks for reporting this bug. I agree with you it's indeed an issue, but I'm not sure about the fix, maybe you can elaborate a bit more about the motivation?

guozhangwang · 2022-01-05T05:07:30Z

-            // For manually assigned partitions, if there are no ready nodes, await metadata.
+            // For manually assigned partitions, if coordinator is unknown, make sure we lookup one and await metadata.
            // If connections to all nodes fail, wakeups triggered while attempting to send fetch
            // requests result in polls returning immediately, causing a tight loop of polls. Without
            // the wakeup, poll() with no channels would block for the timeout, delaying re-connection.
-            // awaitMetadataUpdate() initiates new connections with configured backoff and avoids the busy loop.
-            // When group management is used, metadata wait is already performed for this scenario as
-            // coordinator is unknown, hence this check is not required.
-            if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
-                client.awaitMetadataUpdate(timer);
+            // awaitMetadataUpdate() in ensureCoordinatorReady initiates new connections with configured backoff and avoids the busy loop.
+            if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
+                return false;


I'm wondering if this fix is a bit overkill, e.g. for those consumers who do not even set the group ID and do not rely on the coordinator for anything, the coordinatorUnknown() would always return true while ensureCoordinatorReady would send a discover coordinator request with null group id.

showuon · 2022-01-05T06:05:33Z

@guozhangwang , thanks for your comment. Answering your question below:

I'm wondering if this fix is a bit overkill, e.g. for those consumers who do not even set the group ID and do not rely on the coordinator for anything, the coordinatorUnknown() would always return true while ensureCoordinatorReady would send a discover coordinator request with null group id.

No, if consumer doesn't provide group id config value (default is null), we won't create consumerCoordinator in the consumer. That is, if the group id is provided, it's either with consumer group management (via Consumer#subscribe), or with manual assignment (via Consumer#assign) with offset commit enabled.

REF: https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L775

And before the PR, we will wait for the metadata update if no nodes available, to avoid busy loop when in non consumer group mode.

if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
    client.awaitMetadataUpdate(timer);

After the change, we did as the group management did, to call ensureCoordinatorReady when coordinator unknown. And in ensureCoordinatorReady. This way, we can also make sure to handle the FindCoordinatorFuture well (and clear it) inside ensureCoordinatorReady.

Does that make sense?

showuon · 2022-01-15T03:19:54Z

@guozhangwang @ableegoldman , please take a look when available. Thanks.

showuon · 2022-02-04T03:13:35Z

@guozhangwang @ableegoldman @hachikuji , please take a look when available. This issue blocks some users when upgrading Kafka-clients. I think this should be fixed soon.
Thank you.

guozhangwang · 2022-02-04T03:19:40Z

@showuon Sorry for getting late on this -- I thought it was not ready since the title still has WIP in it. I will re-title and continue reviewing it.

guozhangwang

Hi @showuon thanks for the explanation! I now get it and agrees with you about the fix.

One minor comment still though: the modified logic is exactly the same as

https://github.com/apache/kafka/pull/11631/files#diff-0029e982555d1fae10943b862924da962ca8e247a3070cded92c5f5a5960244fL483-L485

Could we move the whole block before the if-else block and also update the related comments?

Other than that, LGTM!

showuon · 2022-02-04T04:01:03Z

@guozhangwang , thanks for your comments. Sorry for confusing you! And it makes sense to remove the WIP out from the PR title.
I'll let you know when I complete the PR. Thank you.

showuon · 2022-02-06T07:46:45Z

-        ConsumerPartitionAssignor assignor = new RoundRobinAssignor();

-        KafkaConsumer<String, String> consumer = newConsumer(time, client, subscription, metadata, assignor, true, groupInstanceId);
+        KafkaConsumer<String, String> consumer = newConsumer(time, client, subscription, metadata, assignor, true, null, groupInstanceId, false);


This test case is to test manual assignment without storing offsets. So we should not create groupID (the third parameter from the end)

test added for this issue: https://issues.apache.org/jira/browse/KAFKA-4034

showuon · 2022-02-06T07:47:49Z

-        GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,
+        ConsumerCoordinator consumerCoordinator = null;
+        if (groupId != null) {
+            GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,


we should not create consumerCoordinator when groupID is null

Nice cleanup!

showuon · 2022-02-06T07:49:00Z

    // in this case, we do an explicit seek, so there should be no need to query the coordinator at all
-    val consumer = createConsumer()
+    // remove the group.id config to avoid coordinator created
+    val consumer = createConsumer(configsToRemove = List(ConsumerConfig.GROUP_ID_CONFIG))


Same as above.

This test case is to test manual assignment without storing offsets. So we should not create groupID (the third parameter from the end)

test added for this issue: https://issues.apache.org/jira/browse/KAFKA-4034

showuon · 2022-02-06T07:52:01Z

+    private final Time time = new MockTime();
+    private final SubscriptionState subscription = new SubscriptionState(new LogContext(), OffsetResetStrategy.EARLIEST);
+    private final ConsumerPartitionAssignor assignor = new RoundRobinAssignor();


side cleanup: create global variables to share in tests.

showuon · 2022-02-06T07:53:34Z

            // Always update the heartbeat last poll time so that the heartbeat thread does not leave the
            // group proactively due to application inactivity even if (say) the coordinator cannot be found.
            pollHeartbeat(timer.currentTimeMs());
-            if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
+            if (coordinatorUnknownAndUnready(timer)) {


Could we move the whole block before the if-else block and also update the related comments?

We can't because we should always lookup the coordinator after heartbeat poll to avoid the heartbeat timeout in group management case.

Makes sense.

showuon · 2022-02-06T12:47:20Z

@guozhangwang , this PR is good for review now. I've fixed broken tests and added tests. Thank you.

guozhangwang

LGTM!

guozhangwang · 2022-02-06T23:02:44Z

            // Always update the heartbeat last poll time so that the heartbeat thread does not leave the
            // group proactively due to application inactivity even if (say) the coordinator cannot be found.
            pollHeartbeat(timer.currentTimeMs());
-            if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
+            if (coordinatorUnknownAndUnready(timer)) {


Makes sense.

guozhangwang · 2022-02-06T23:06:19Z

-        GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,
+        ConsumerCoordinator consumerCoordinator = null;
+        if (groupId != null) {
+            GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,


Nice cleanup!

…11631) After KAFKA-10793, we clear the findCoordinatorFuture in 2 places: 1. heartbeat thread 2. AbstractCoordinator#ensureCoordinatorReady But in non consumer group mode with group id provided (for offset commitment. So that there will be consumerCoordinator created), there will be no (1)heartbeat thread , and it only call (2)AbstractCoordinator#ensureCoordinatorReady when 1st time consumer wants to fetch committed offset position. That is, after 2nd lookupCoordinator call, we have no chance to clear the findCoordinatorFuture , and causes the offset commit never succeeded. To avoid the race condition as KAFKA-10793 mentioned, it's not safe to clear the findCoordinatorFuture in the future listener. So, I think we can fix this issue by calling AbstractCoordinator#ensureCoordinatorReady when coordinator unknown in non consumer group case, under each ConsumerCoordinator#poll. Reviewers: Guozhang Wang <wangguoz@gmail.com>

guozhangwang · 2022-02-06T23:10:26Z

Merged to trunk, thanks @showuon ! Also cherry-picked to 3.1.

…pache#11631) After KAFKA-10793, we clear the findCoordinatorFuture in 2 places: 1. heartbeat thread 2. AbstractCoordinator#ensureCoordinatorReady But in non consumer group mode with group id provided (for offset commitment. So that there will be consumerCoordinator created), there will be no (1)heartbeat thread , and it only call (2)AbstractCoordinator#ensureCoordinatorReady when 1st time consumer wants to fetch committed offset position. That is, after 2nd lookupCoordinator call, we have no chance to clear the findCoordinatorFuture , and causes the offset commit never succeeded. To avoid the race condition as KAFKA-10793 mentioned, it's not safe to clear the findCoordinatorFuture in the future listener. So, I think we can fix this issue by calling AbstractCoordinator#ensureCoordinatorReady when coordinator unknown in non consumer group case, under each ConsumerCoordinator#poll. Reviewers: Guozhang Wang <wangguoz@gmail.com>

…2244) This is another way of fixing KAFKA-13563 other than #11631. Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator: * commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix. * commitSync, which we already try to re-discovery coordinator. * committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator. The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in #11631 would let the consumer to discover coordinator even if none of the above operations are required. Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>

…2244) (#12259) This is a cherrypick commit of 3.1. Another way of fixing KAFKA-13563 other than #11631. Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator: commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix. commitSync, which we already try to re-discovery coordinator. committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator. The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in #11631 would let the consumer to discover coordinator even if none of the above operations are required. Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>

…ache#12244) (apache#12259) This is a cherrypick commit of 3.1. Another way of fixing KAFKA-13563 other than apache#11631. Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator: commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix. commitSync, which we already try to re-discovery coordinator. committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator. The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required. Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>

…ache#12244) This is another way of fixing KAFKA-13563 other than apache#11631. Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator: * commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix. * commitSync, which we already try to re-discovery coordinator. * committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator. The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required. Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>

* HOTFIX: only try to clear discover-coordinator future upon commit (apache#12244) This is another way of fixing KAFKA-13563 other than apache#11631. Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator: * commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix. * commitSync, which we already try to re-discovery coordinator. * committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator. The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required. Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io> * HOTFIX: add space to avoid checkstyle failure Co-authored-by: Guozhang Wang <wangguoz@gmail.com>

…ache#12244) (apache#12259) (#723) This is a cherrypick commit of 3.1. Another way of fixing KAFKA-13563 other than apache#11631. Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator: commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix. commitSync, which we already try to re-discovery coordinator. committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator. The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required. Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io> Co-authored-by: Guozhang Wang <wangguoz@gmail.com>

…et commits (#12626) Asynchronous offset commits may throw an unexpected WakeupException following #11631 and #12244. This patch fixes the problem by passing through a flag to ensureCoordinatorReady to indicate whether wakeups should be disabled. This is used to disable wakeups in the context of asynchronous offset commits. All other uses leave wakeups enabled. Note: this patch builds on top of #12611. Co-Authored-By: Guozhang Wang wangguoz@gmail.com Reviewers: Luke Chen <showuon@gmail.com>

KAFKA-13563: ensure coordinator ready for each consumer poll, to hand…

9375b67

…le FindCoordinatorFuture correctly

showuon changed the title ~~KAFKA-13563: ensure coordinator ready for each consumer poll, to hand…~~ KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode Dec 30, 2021

showuon commented Dec 30, 2021

View reviewed changes

showuon changed the title ~~KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode~~ [WIP] KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode Dec 30, 2021

guozhangwang reviewed Jan 5, 2022

View reviewed changes

guozhangwang changed the title ~~[WIP] KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode~~ KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode Feb 4, 2022

guozhangwang reviewed Feb 4, 2022

View reviewed changes

showuon added 3 commits February 4, 2022 20:52

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-13563

23e5884

KAFKA-13563: fix broken tests and add tests

8e27298

KAFKA-13563: add tests

796ba96

showuon force-pushed the KAFKA-13563 branch from f4881b1 to 796ba96 Compare February 6, 2022 07:39

showuon commented Feb 6, 2022

View reviewed changes

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-13563

3384017

guozhangwang approved these changes Feb 6, 2022

View reviewed changes

guozhangwang merged commit ca5d6f9 into apache:trunk Feb 6, 2022

guozhangwang mentioned this pull request Jun 2, 2022

HOTFIX: only try to clear discover-coordinator future upon commit #12244

Merged

3 tasks

guozhangwang mentioned this pull request Jun 6, 2022

HOTFIX: only try to clear discover-coordinator future upon commit #12259

Merged

3 tasks

philipnee mentioned this pull request Jun 7, 2022

HOTFIX: cherry-pick KAFKA-13563 compatiblity fix into 3.1 confluentinc/kafka#723

Merged

3 tasks

philipnee mentioned this pull request Jun 7, 2022

HOTFIX: cherry-pick KAFKA-13563 compatibility fix into 3.2 confluentinc/kafka#724

Merged

hachikuji mentioned this pull request Sep 12, 2022

KAFKA-14208; Do not raise wakeup in consumer during asynchronous offset commits #12626

Merged

3 tasks

Conversation

showuon commented Dec 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Dec 30, 2021

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

showuon commented Jan 15, 2022

Uh oh!

showuon commented Feb 4, 2022

Uh oh!

guozhangwang commented Feb 4, 2022

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

showuon commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon Feb 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Feb 6, 2022

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Feb 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

showuon commented Dec 30, 2021 •

edited

Loading

showuon commented Jan 5, 2022 •

edited

Loading

showuon commented Feb 4, 2022 •

edited

Loading

showuon Feb 6, 2022 •

edited

Loading