Skip to content

KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode#11631

Merged
guozhangwang merged 5 commits intoapache:trunkfrom
showuon:KAFKA-13563
Feb 6, 2022
Merged

KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode#11631
guozhangwang merged 5 commits intoapache:trunkfrom
showuon:KAFKA-13563

Conversation

@showuon
Copy link
Copy Markdown
Member

@showuon showuon commented Dec 30, 2021

jira: KAFKA-13563

After KAFKA-10793, we clear the findCoordinatorFuture in 2 places:

1. heartbeat thread
2. AbstractCoordinator#ensureCoordinatorReady

But in non consumer group mode with group id provided (for offset commitment. So that there will be consumerCoordinator created), there will be no (1)heartbeat thread , and it only call (2)AbstractCoordinator#ensureCoordinatorReady when 1st time consumer wants to fetch committed offset position. That is, after 2nd lookupCoordinator call, we have no chance to clear the findCoordinatorFuture , and causes the offset commit never succeeded.

To avoid the race condition as KAFKA-10793 mentioned, it's not safe to clear the findCoordinatorFuture in the future listener. So, I think we can fix this issue by calling AbstractCoordinator#ensureCoordinatorReady when coordinator unknown in non consumer group case, under each ConsumerCoordinator#poll.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon showuon changed the title KAFKA-13563: ensure coordinator ready for each consumer poll, to hand… KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode Dec 30, 2021
Comment on lines -520 to +526
// For manually assigned partitions, if there are no ready nodes, await metadata.
// For manually assigned partitions, if coordinator is unknown, make sure we lookup one and await metadata.
// If connections to all nodes fail, wakeups triggered while attempting to send fetch
// requests result in polls returning immediately, causing a tight loop of polls. Without
// the wakeup, poll() with no channels would block for the timeout, delaying re-connection.
// awaitMetadataUpdate() initiates new connections with configured backoff and avoids the busy loop.
// When group management is used, metadata wait is already performed for this scenario as
// coordinator is unknown, hence this check is not required.
if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
client.awaitMetadataUpdate(timer);
// awaitMetadataUpdate() in ensureCoordinatorReady initiates new connections with configured backoff and avoids the busy loop.
if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
return false;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, we will wait for the metadata update if no nodes available, to avoid busy loop when in non consumer group mode. After the change, we did as the group management did, to call ensureCoordinatorReady when coordinator unknown. And in ensureCoordinatorReady. This way, we can also make sure to handle the FindCoordinatorFuture well (and clear it) inside ensureCoordinatorReady.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this fix is a bit overkill, e.g. for those consumers who do not even set the group ID and do not rely on the coordinator for anything, the coordinatorUnknown() would always return true while ensureCoordinatorReady would send a discover coordinator request with null group id.

@showuon showuon changed the title KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode [WIP] KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode Dec 30, 2021
@showuon
Copy link
Copy Markdown
Member Author

showuon commented Dec 30, 2021

@guozhangwang @ableegoldman , could you please take a look to see if this fix makes sense to you? I'll fix the broken tests after your first review. Thank you.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @showuon thanks for reporting this bug. I agree with you it's indeed an issue, but I'm not sure about the fix, maybe you can elaborate a bit more about the motivation?

Comment on lines -520 to +526
// For manually assigned partitions, if there are no ready nodes, await metadata.
// For manually assigned partitions, if coordinator is unknown, make sure we lookup one and await metadata.
// If connections to all nodes fail, wakeups triggered while attempting to send fetch
// requests result in polls returning immediately, causing a tight loop of polls. Without
// the wakeup, poll() with no channels would block for the timeout, delaying re-connection.
// awaitMetadataUpdate() initiates new connections with configured backoff and avoids the busy loop.
// When group management is used, metadata wait is already performed for this scenario as
// coordinator is unknown, hence this check is not required.
if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
client.awaitMetadataUpdate(timer);
// awaitMetadataUpdate() in ensureCoordinatorReady initiates new connections with configured backoff and avoids the busy loop.
if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this fix is a bit overkill, e.g. for those consumers who do not even set the group ID and do not rely on the coordinator for anything, the coordinatorUnknown() would always return true while ensureCoordinatorReady would send a discover coordinator request with null group id.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 5, 2022

@guozhangwang , thanks for your comment. Answering your question below:

I'm wondering if this fix is a bit overkill, e.g. for those consumers who do not even set the group ID and do not rely on the coordinator for anything, the coordinatorUnknown() would always return true while ensureCoordinatorReady would send a discover coordinator request with null group id.

No, if consumer doesn't provide group id config value (default is null), we won't create consumerCoordinator in the consumer. That is, if the group id is provided, it's either with consumer group management (via Consumer#subscribe), or with manual assignment (via Consumer#assign) with offset commit enabled.

REF: https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L775

And before the PR, we will wait for the metadata update if no nodes available, to avoid busy loop when in non consumer group mode.

if (metadata.updateRequested() && !client.hasReadyNodes(timer.currentTimeMs())) {
    client.awaitMetadataUpdate(timer);

After the change, we did as the group management did, to call ensureCoordinatorReady when coordinator unknown. And in ensureCoordinatorReady. This way, we can also make sure to handle the FindCoordinatorFuture well (and clear it) inside ensureCoordinatorReady.

Does that make sense?

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Jan 15, 2022

@guozhangwang @ableegoldman , please take a look when available. Thanks.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Feb 4, 2022

@guozhangwang @ableegoldman @hachikuji , please take a look when available. This issue blocks some users when upgrading Kafka-clients. I think this should be fixed soon.
Thank you.

@guozhangwang
Copy link
Copy Markdown
Contributor

@showuon Sorry for getting late on this -- I thought it was not ready since the title still has WIP in it. I will re-title and continue reviewing it.

@guozhangwang guozhangwang changed the title [WIP] KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode KAFKA-13563: clear FindCoordinatorFuture for non consumer group mode Feb 4, 2022
Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @showuon thanks for the explanation! I now get it and agrees with you about the fix.

One minor comment still though: the modified logic is exactly the same as

https://github.com/apache/kafka/pull/11631/files#diff-0029e982555d1fae10943b862924da962ca8e247a3070cded92c5f5a5960244fL483-L485

Could we move the whole block before the if-else block and also update the related comments?

Other than that, LGTM!

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Feb 4, 2022

@guozhangwang , thanks for your comments. Sorry for confusing you! And it makes sense to remove the WIP out from the PR title.
I'll let you know when I complete the PR. Thank you.

ConsumerPartitionAssignor assignor = new RoundRobinAssignor();

KafkaConsumer<String, String> consumer = newConsumer(time, client, subscription, metadata, assignor, true, groupInstanceId);
KafkaConsumer<String, String> consumer = newConsumer(time, client, subscription, metadata, assignor, true, null, groupInstanceId, false);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is to test manual assignment without storing offsets. So we should not create groupID (the third parameter from the end)

test added for this issue: https://issues.apache.org/jira/browse/KAFKA-4034

GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,
ConsumerCoordinator consumerCoordinator = null;
if (groupId != null) {
GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not create consumerCoordinator when groupID is null

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

// in this case, we do an explicit seek, so there should be no need to query the coordinator at all
val consumer = createConsumer()
// remove the group.id config to avoid coordinator created
val consumer = createConsumer(configsToRemove = List(ConsumerConfig.GROUP_ID_CONFIG))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

This test case is to test manual assignment without storing offsets. So we should not create groupID (the third parameter from the end)

test added for this issue: https://issues.apache.org/jira/browse/KAFKA-4034

Comment on lines +195 to +197
private final Time time = new MockTime();
private final SubscriptionState subscription = new SubscriptionState(new LogContext(), OffsetResetStrategy.EARLIEST);
private final ConsumerPartitionAssignor assignor = new RoundRobinAssignor();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side cleanup: create global variables to share in tests.

Comment on lines 492 to +495
// Always update the heartbeat last poll time so that the heartbeat thread does not leave the
// group proactively due to application inactivity even if (say) the coordinator cannot be found.
pollHeartbeat(timer.currentTimeMs());
if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
if (coordinatorUnknownAndUnready(timer)) {
Copy link
Copy Markdown
Member Author

@showuon showuon Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the whole block before the if-else block and also update the related comments?

We can't because we should always lookup the coordinator after heartbeat poll to avoid the heartbeat timeout in group management case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Feb 6, 2022

@guozhangwang , this PR is good for review now. I've fixed broken tests and added tests. Thank you.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 492 to +495
// Always update the heartbeat last poll time so that the heartbeat thread does not leave the
// group proactively due to application inactivity even if (say) the coordinator cannot be found.
pollHeartbeat(timer.currentTimeMs());
if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
if (coordinatorUnknownAndUnready(timer)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,
ConsumerCoordinator consumerCoordinator = null;
if (groupId != null) {
GroupRebalanceConfig rebalanceConfig = new GroupRebalanceConfig(sessionTimeoutMs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

@guozhangwang guozhangwang merged commit ca5d6f9 into apache:trunk Feb 6, 2022
guozhangwang pushed a commit that referenced this pull request Feb 6, 2022
…11631)

After KAFKA-10793, we clear the findCoordinatorFuture in 2 places:

1. heartbeat thread
2. AbstractCoordinator#ensureCoordinatorReady

But in non consumer group mode with group id provided (for offset commitment. So that there will be consumerCoordinator created), there will be no (1)heartbeat thread , and it only call (2)AbstractCoordinator#ensureCoordinatorReady when 1st time consumer wants to fetch committed offset position. That is, after 2nd lookupCoordinator call, we have no chance to clear the findCoordinatorFuture , and causes the offset commit never succeeded.

To avoid the race condition as KAFKA-10793 mentioned, it's not safe to clear the findCoordinatorFuture in the future listener. So, I think we can fix this issue by calling AbstractCoordinator#ensureCoordinatorReady when coordinator unknown in non consumer group case, under each ConsumerCoordinator#poll.


Reviewers: Guozhang Wang <wangguoz@gmail.com>
@guozhangwang
Copy link
Copy Markdown
Contributor

Merged to trunk, thanks @showuon ! Also cherry-picked to 3.1.

jeffkbkim pushed a commit to confluentinc/kafka that referenced this pull request May 12, 2022
…pache#11631)

After KAFKA-10793, we clear the findCoordinatorFuture in 2 places:

1. heartbeat thread
2. AbstractCoordinator#ensureCoordinatorReady

But in non consumer group mode with group id provided (for offset commitment. So that there will be consumerCoordinator created), there will be no (1)heartbeat thread , and it only call (2)AbstractCoordinator#ensureCoordinatorReady when 1st time consumer wants to fetch committed offset position. That is, after 2nd lookupCoordinator call, we have no chance to clear the findCoordinatorFuture , and causes the offset commit never succeeded.

To avoid the race condition as KAFKA-10793 mentioned, it's not safe to clear the findCoordinatorFuture in the future listener. So, I think we can fix this issue by calling AbstractCoordinator#ensureCoordinatorReady when coordinator unknown in non consumer group case, under each ConsumerCoordinator#poll.


Reviewers: Guozhang Wang <wangguoz@gmail.com>
guozhangwang added a commit that referenced this pull request Jun 6, 2022
…2244)

This is another way of fixing KAFKA-13563 other than #11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

* commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
* commitSync, which we already try to re-discovery coordinator.
* committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.

The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in #11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
guozhangwang added a commit that referenced this pull request Jun 6, 2022
…2244)

This is another way of fixing KAFKA-13563 other than #11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

* commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
* commitSync, which we already try to re-discovery coordinator.
* committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.

The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in #11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
ijuma pushed a commit that referenced this pull request Jun 7, 2022
…2244) (#12259)

This is a cherrypick commit of 3.1. Another way of fixing KAFKA-13563 other than #11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
commitSync, which we already try to re-discovery coordinator.
committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.
The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in #11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
philipnee pushed a commit to confluentinc/kafka that referenced this pull request Jun 7, 2022
…ache#12244) (apache#12259)

This is a cherrypick commit of 3.1. Another way of fixing KAFKA-13563 other than apache#11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
commitSync, which we already try to re-discovery coordinator.
committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.
The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
philipnee pushed a commit to confluentinc/kafka that referenced this pull request Jun 7, 2022
…ache#12244)

This is another way of fixing KAFKA-13563 other than apache#11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

* commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
* commitSync, which we already try to re-discovery coordinator.
* committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.

The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
philipnee added a commit to confluentinc/kafka that referenced this pull request Jun 7, 2022
* HOTFIX: only try to clear discover-coordinator future upon commit (apache#12244)

This is another way of fixing KAFKA-13563 other than apache#11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

* commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
* commitSync, which we already try to re-discovery coordinator.
* committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.

The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>

* HOTFIX: add space to avoid checkstyle failure

Co-authored-by: Guozhang Wang <wangguoz@gmail.com>
philipnee added a commit to confluentinc/kafka that referenced this pull request Jun 8, 2022
…ache#12244) (apache#12259) (#723)

This is a cherrypick commit of 3.1. Another way of fixing KAFKA-13563 other than apache#11631.

Instead of letting the consumer to always try to discover coordinator in pool with either mode (subscribe / assign), we defer the clearance of discover future upon committing async only. More specifically, under manual assign mode, there are only three places where we need the coordinator:

commitAsync (both by the consumer itself or triggered by caller), this is where we want to fix.
commitSync, which we already try to re-discovery coordinator.
committed (both by the consumer itself based on reset policy, or triggered by caller), which we already try to re-discovery coordinator.
The benefits are that for manual assign mode that does not try to trigger any of the above three, then we never would be discovering coordinator. The original fix in apache#11631 would let the consumer to discover coordinator even if none of the above operations are required.

Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>

Co-authored-by: Guozhang Wang <wangguoz@gmail.com>
showuon pushed a commit that referenced this pull request Sep 13, 2022
…et commits (#12626)

Asynchronous offset commits may throw an unexpected WakeupException following #11631 and #12244. This patch fixes the problem by passing through a flag to ensureCoordinatorReady to indicate whether wakeups should be disabled. This is used to disable wakeups in the context of asynchronous offset commits. All other uses leave wakeups enabled.

Note: this patch builds on top of #12611.

Co-Authored-By: Guozhang Wang wangguoz@gmail.com

Reviewers: Luke Chen <showuon@gmail.com>
showuon pushed a commit that referenced this pull request Sep 13, 2022
…et commits (#12626)

Asynchronous offset commits may throw an unexpected WakeupException following #11631 and #12244. This patch fixes the problem by passing through a flag to ensureCoordinatorReady to indicate whether wakeups should be disabled. This is used to disable wakeups in the context of asynchronous offset commits. All other uses leave wakeups enabled.

Note: this patch builds on top of #12611.

Co-Authored-By: Guozhang Wang wangguoz@gmail.com

Reviewers: Luke Chen <showuon@gmail.com>
showuon pushed a commit that referenced this pull request Sep 13, 2022
…et commits (#12626)

Asynchronous offset commits may throw an unexpected WakeupException following #11631 and #12244. This patch fixes the problem by passing through a flag to ensureCoordinatorReady to indicate whether wakeups should be disabled. This is used to disable wakeups in the context of asynchronous offset commits. All other uses leave wakeups enabled.

Note: this patch builds on top of #12611.

Co-Authored-By: Guozhang Wang wangguoz@gmail.com

Reviewers: Luke Chen <showuon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants