Skip to content

KAFKA-8649: send latest commonly supported version in assignment#7423

Merged
guozhangwang merged 9 commits intoapache:trunkfrom
ableegoldman:VP-lie
Oct 2, 2019
Merged

KAFKA-8649: send latest commonly supported version in assignment#7423
guozhangwang merged 9 commits intoapache:trunkfrom
ableegoldman:VP-lie

Conversation

@ableegoldman
Copy link
Copy Markdown
Member

@ableegoldman ableegoldman commented Sep 30, 2019

...instead of sending the leader's version and having older members try to blindly upgrade.

The only other real change here is that we will also set the VERSION_PROBING error code and return early from onAssignment when we are upgrading our used subscription version (not just downgrading it) since this implies the whole group has finished the rolling upgrade and all members should rejoin with the new subscription version.

Also piggy-backing on a fix for a potentially dangerous edge case, where every thread of an instance is assigned the same set of active tasks.

Should be cherry-picked back to 2.1
2.4/trunk: this PR
2.3: PR #7425
2.2: PR #7426
2.1: PR #7427

@ableegoldman
Copy link
Copy Markdown
Member Author

Not sure if we need any additional changes to StreamsUpgradeTest but I kicked off the version probing system test: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3055/

At some point we definitely do want to make some adjustments to StreamsUpgradeTest so it can catch bugs like this, but that doesn't need to block the bugfix.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great to me, and I like the cleanup a lot! Just some minor comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it seems all constructors would trigger the version check anyways, so shall we just move this in to the last constructor (with 6 params)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this cleanup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rename the field in AssignmentInfo as minSupportedVersion -- or commonlySupportedVersion is also fine -- and also rename the access function names as well as the local fields here? I felt latestSupportedVersion is becoming confusing now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I kept flip-flopping on what I thought a good name would be -- happy to just go with commonlySupportedVersion

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If receivedAssignmentMetadataVersion < EARLIEST_PROBEABLE_VERSION, it is also a pretty bad situation since it means that the leader is too old to even be covered with version probing --- in fact, if it is that old, upon getting the new version it would throw exception and die immediately, so we'd not even see the assignment back at all --- so we'd better also log it as an ERROR to complete the logic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we should actually check for this (in validateMetadataVersions) and throw an exception -- it basically means someone tried to do a rolling bounce from < 2.0 without setting the UPGRADE_FROM config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this would not be an error but possibly just an upgrade from a very old version -- see #7436

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding some javadoc here explaining our logic:

latestCommonlySupportedVersion belongs to [usedSubscriptionMetadataVersion, LATEST_SUPPORTED_VERSION]

receivedAssignmentMetadataVersion belongs to [EARLIEST_PROBEABLE_VERSION, usedSubscriptionMetadataVersion]

usedSubscriptionMetadataVersion would be upgraded to latestCommonlySupportedVersion when necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can collapse these two into a single private function.

@guozhangwang
Copy link
Copy Markdown
Contributor

:streams:checkstyleTest failed.

fix system test (hopefully...)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message didn't make sense, it should be referring to upgrading the metadata to the newer version

set leader counter
@ableegoldman ableegoldman force-pushed the VP-lie branch 2 times, most recently from 72f10aa to 6b7ea3c Compare October 1, 2019 04:09
monitors[first_other_processor] = first_other_monitor
monitors[second_other_processor] = second_other_monitor

leader_monitor.wait_until("Received a future (version probing) subscription (version: 6). Sending empty assignment back (with supported version 5).",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on removing this logic, we could not assume that the leader did not change even if it is not the one who get's bounced, since it may join the group late.

@guozhangwang
Copy link
Copy Markdown
Contributor

Failed JDK 11 and Scala 2.12, JDK 8 and Scala 2.11 on different tests (one for each run), no overlapping on the fail test.

@guozhangwang
Copy link
Copy Markdown
Contributor

The PR LGTM, merging to trunk.

@guozhangwang guozhangwang merged commit 8da6993 into apache:trunk Oct 2, 2019
guozhangwang pushed a commit that referenced this pull request Oct 4, 2019
Small follow-up to trunk PR #7423

While debugging the 2.3 VP PR we realized we should remove the leader-tracking from the VP system test altogether. We'd already merged the corresponding trunk PR so I made a quick new PR for trunk (also fixes a missed version bump in one of the log messages)

Reviewers: Guozhang Wang <wangguoz@gmail.com>
@ableegoldman ableegoldman deleted the VP-lie branch June 26, 2020 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants