Skip to content

KAFKA-9232: Coordinator new member heartbeat completion does not work for JoinGroup v3#7753

Merged
hachikuji merged 8 commits intoapache:trunkfrom
ableegoldman:9232-coordinator-heartbeat-bug
Dec 23, 2019
Merged

KAFKA-9232: Coordinator new member heartbeat completion does not work for JoinGroup v3#7753
hachikuji merged 8 commits intoapache:trunkfrom
ableegoldman:9232-coordinator-heartbeat-bug

Conversation

@ableegoldman
Copy link
Copy Markdown
Member

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout. The core issue is in the shouldKeepAlive method, which returns false when it should return true.

@ableegoldman
Copy link
Copy Markdown
Member Author

@hachikuji

def shouldKeepAlive(deadlineMs: Long): Boolean = {
if (isAwaitingJoin)
!isNew || latestHeartbeat + GroupCoordinator.NewMemberJoinTimeoutMs > deadlineMs
else awaitingSyncCallback != null ||
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced awaitingSyncCallback != null with isAwaitingSync

if (isAwaitingJoin)
!isNew || latestHeartbeat + GroupCoordinator.NewMemberJoinTimeoutMs > deadlineMs
else awaitingSyncCallback != null ||
else if (isNew || isAwaitingSync)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In onCompleteJoin, we call

  1. group.maybeInvokeJoinCallback(member, joinResult)
  2. completeAndScheduleNextHeartbeatExpiration(group, member)
  3. member.isNew = false

In 1. we ultimately set member.isAwaitingJoin to false, so we never reach the first branch in this case.

I refactored this a bit to reflect what (I think) makes sense: we want to always return true in the special cases of a) awaiting Sync, and b) between awaiting the Join and awaiting the Sync, ie, when isNew = true -- otherwise, we fall back to the usual case, ie, just return whether we've seen a heartbeat within the deadline or not

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could leave this method alone and change the ordering in onCompleteJoin to

member.isNew = false
completeAndScheduleNextHeartbeatExpiration(group, member)
group.maybeInvokeJoinCallback(member, joinResult)

I think this also makes sense, and might be a bit more clear, but I'd want to verify that it's ok to move the heartbeat completion to before the group.maybeInvokeJoinCallback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm for simplifying if possible. What I am thinking is the following:

if (isNew)
  // New members are expired after the static join timeout
  latestHeartbeat + GroupCoordinator.NewMemberJoinTimeoutMs > deadlineMs
else if (isAwaitingJoin || isAwaitingSync)
  // Don't remove members as long as they have a request in purgatory
  true
else
  // Otherwise check for session expiration
  latestHeartbeat + sessionTimeoutMs > deadlineMs

Then I think this works with the reordering you suggested above:

member.isNew = false
completeAndScheduleNextHeartbeatExpiration(group, member)
group.maybeInvokeJoinCallback(member, joinResult)

Upon rebalance completion, the first check will fail since we set isNew to false and the second check will succeed because we haven't cleared the join callback yet. What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah I like that, the shouldKeepAlive logic is much easier to follow

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 27, 2019

@ableegoldman do you know when we regressed?

@hachikuji
Copy link
Copy Markdown
Contributor

@ijuma I think it was introduced in #5962 which was in 2.1. The bug doesn't affect newer versions of the protocol.

@ableegoldman
Copy link
Copy Markdown
Member Author

@hachikuji updated the PR, let me know if we want to test this beyond just the usual unit/integration tests or if it's good as is

@hachikuji
Copy link
Copy Markdown
Contributor

@ableegoldman I'd be happy with unit tests.

!isNew || latestHeartbeat + GroupCoordinator.NewMemberJoinTimeoutMs > deadlineMs
else awaitingSyncCallback != null ||
if (isNew)
// New members are expired after the static join timeout
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we fix the indentation? i think braces are fine

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. Left some small suggestions on the test case.

@Test
def testCompleteNewMemberTimeoutHeartbeat(): Unit = {
// New member joins the group
var responseFuture = sendJoinGroup(groupId, JoinGroupRequest.UNKNOWN_MEMBER_ID, protocolType, protocols, None, DefaultSessionTimeout, DefaultRebalanceTimeout, false)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be val

val joinResult = Await.result(responseFuture, Duration(DefaultRebalanceTimeout + 100, TimeUnit.MILLISECONDS))
val group = groupCoordinator.groupManager.getGroup(groupId).get
assertEquals(Errors.NONE, joinResult.error)
assertEquals(0, group.allMemberMetadata.count(_.isNew))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth asserting latestHeartbeat as well?

assertEquals(0, group.allMemberMetadata.count(_.isNew))

// Make sure the delayed heartbeat based on the new member timeout has been completed
assertEquals(1, groupCoordinator.heartbeatPurgatory.numDelayed)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to go a tad further, we could advance the clock by the session timeout and assert the member gets kicked out. That ensures that the heartbeat left behind has the right timeout.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

2 similar comments
@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji hachikuji merged commit 6b6b36c into apache:trunk Dec 23, 2019
hachikuji pushed a commit that referenced this pull request Dec 24, 2019
…p v3 and below (#7753)

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.

Reviewers: Jason Gustafson <jason@confluent.io>
hachikuji pushed a commit that referenced this pull request Dec 24, 2019
…p v3 and below (#7753)

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.

Reviewers: Jason Gustafson <jason@confluent.io>
hachikuji pushed a commit that referenced this pull request Dec 24, 2019
…p v3 and below (#7753)

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.

Reviewers: Jason Gustafson <jason@confluent.io>
hachikuji pushed a commit that referenced this pull request Dec 24, 2019
…p v3 and below (#7753)

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.

Reviewers: Jason Gustafson <jason@confluent.io>
hachikuji pushed a commit to confluentinc/kafka that referenced this pull request Jan 8, 2020
…p v3 and below (apache#7753)

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.

Reviewers: Jason Gustafson <jason@confluent.io>
qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020
…p v3 and below (apache#7753)

The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.

Reviewers: Jason Gustafson <jason@confluent.io>
@ableegoldman ableegoldman deleted the 9232-coordinator-heartbeat-bug branch June 26, 2020 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants