Skip to content

[LI-HOTFIX] Rollback hotfix to pick up fix for KAFKA-9212 and KAFKA-9261.#63

Merged
xiowu0 merged 3 commits intolinkedin:2.3-lifrom
efeg:fix/cherryPickPR7805
Dec 12, 2019
Merged

[LI-HOTFIX] Rollback hotfix to pick up fix for KAFKA-9212 and KAFKA-9261.#63
xiowu0 merged 3 commits intolinkedin:2.3-lifrom
efeg:fix/cherryPickPR7805

Conversation

@efeg
Copy link
Copy Markdown

@efeg efeg commented Dec 12, 2019

[LI-HOTFIX] Rollback hotfix to pick up fix for KAFKA-9212 and KAFKA-9261. (#63)

TICKET =[KAFKA-9212, KAFKA-9261]
LI_DESCRIPTION =

1. Rollback hotfix for "Rollback KAFKA-7440 as a workaround for KAFKA-9212"

2. KAFKA-9261; Client should handle inconsistent leader metadata (#7772)

This is a reduced scope fix for KAFKA-9261. The purpose of this patch is to ensure that
partition leader state is kept in sync with broker metadata in MetadataCache and
consequently in Cluster. Due to the possibility of metadata event reordering, it was
possible for this state to be inconsistent which could lead to an NPE in some cases. The
test case here provides a specific scenario where this could happen.

Also see #7770 for additional detail.

Reviewers: Ismael Juma <ismael@juma.me.uk>

3. KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment

KIP-320 improved fetch semantics by adding leader epoch validation. This relies on
reliable propagation of leader epoch information from the controller. Unfortunately, we
have encountered a bug during partition reassignment in which the leader epoch in the
controller context does not get properly updated. This causes UpdateMetadata requests
to be sent with stale epoch information which results in the metadata caches on the
brokers falling out of sync.

This bug has existed for a long time, but it is only a problem due to the new epoch
validation done by the client. Because the client includes the stale leader epoch in its
requests, the leader rejects them, yet the stale metadata cache on the brokers prevents
the consumer from getting the latest epoch. Hence the consumer cannot make progress
while a reassignment is ongoing.

Although it is straightforward to fix this problem in the controller for the new releases
(which this patch does), it is not so easy to fix older brokers which means new clients
could still encounter brokers with this bug. To address this problem, this patch also
modifies the client to treat the leader epoch returned from the Metadata response as
"unreliable" if it comes from an older version of the protocol. The client in this case will
discard the returned epoch and it won't be included in any requests.

Also, note that the correct epoch is still forwarded to replicas correctly in the
LeaderAndIsr request, so this bug does not affect replication.

Reviewers: Jun Rao <junrao@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

efeg and others added 3 commits December 11, 2019 16:43
TICKET =
LI_DESCRIPTION =

EXIT_CRITERIA = MANUAL [""]
…data (apache#7772)

TICKET = 
LI_DESCRIPTION = 

This is a reduced scope fix for KAFKA-9261. The purpose of this patch is to ensure that
partition leader state is kept in sync with broker metadata in MetadataCache and
consequently in Cluster. Due to the possibility of metadata event reordering, it was
possible for this state to be inconsistent which could lead to an NPE in some cases. The
test case here provides a specific scenario where this could happen.

Also see apache#7770 for additional detail.

Reviewers: Ismael Juma <ismael@juma.me.uk>
EXIT_CRITERIA = MANUAL [""]
…ted in controller context during reassignment

TICKET = KAFKA-9212
LI_DESCRIPTION =

EXIT_CRITERIA = HASH [baf7766]
ORIGINAL_DESCRIPTION =

KIP-320 improved fetch semantics by adding leader epoch validation. This relies on
reliable propagation of leader epoch information from the controller. Unfortunately, we
have encountered a bug during partition reassignment in which the leader epoch in the
controller context does not get properly updated. This causes UpdateMetadata requests
to be sent with stale epoch information which results in the metadata caches on the
brokers falling out of sync.

This bug has existed for a long time, but it is only a problem due to the new epoch
validation done by the client. Because the client includes the stale leader epoch in its
requests, the leader rejects them, yet the stale metadata cache on the brokers prevents
the consumer from getting the latest epoch. Hence the consumer cannot make progress
while a reassignment is ongoing.

Although it is straightforward to fix this problem in the controller for the new releases
(which this patch does), it is not so easy to fix older brokers which means new clients
could still encounter brokers with this bug. To address this problem, this patch also
modifies the client to treat the leader epoch returned from the Metadata response as
"unreliable" if it comes from an older version of the protocol. The client in this case will
discard the returned epoch and it won't be included in any requests.

Also, note that the correct epoch is still forwarded to replicas correctly in the
LeaderAndIsr request, so this bug does not affect replication.

Reviewers: Jun Rao <junrao@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
@efeg efeg requested a review from xiowu0 December 12, 2019 01:37
Copy link
Copy Markdown

@xiowu0 xiowu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please Change "[LI-HOTFIX] KAFKA-9261;" to [LI-CHERRY-PICK] as discussed.

@xiowu0 xiowu0 merged commit 49d8318 into linkedin:2.3-li Dec 12, 2019
@efeg efeg changed the title Rollback hotfix to pick up fix for KAFKA-9212 and KAFKA-9261. [LI-HOTFIX] Rollback hotfix to pick up fix for KAFKA-9212 and KAFKA-9261. Dec 28, 2019
@efeg efeg deleted the fix/cherryPickPR7805 branch January 6, 2020 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants