Skip to content

KAFKA-9701 (fix): Only check protocol name when generation is valid#8324

Merged
guozhangwang merged 2 commits intoapache:trunkfrom
abbccdda:KAFKA-9701-fix
Mar 21, 2020
Merged

KAFKA-9701 (fix): Only check protocol name when generation is valid#8324
guozhangwang merged 2 commits intoapache:trunkfrom
abbccdda:KAFKA-9701-fix

Conversation

@abbccdda
Copy link
Copy Markdown

This bug was incurred by #7994 with a too-strong consistency check. It is because a reset generation operation could be called in between the joinGroupRequest -> joinGroupResponse -> SyncGroupRequest -> SyncGroupResponse sequence of events, if user calls unsubscribe in the middle of consumer#poll().

Proper fix is to avoid the protocol name check when the generation is invalid.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@apurvam
Copy link
Copy Markdown
Contributor

apurvam commented Mar 21, 2020

Thanks for the patch.. How was this found? Should we add tests?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just side cleanups starting L700

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Mar 21, 2020

@abbccdda Would it be valid to call unsubscribe() during poll()? (I assume you mean a call in one of the RebalanceListener callback methods?)

If we consider it not valid, we might rather raise an IllegalStateException of somebody make such a call. What would be the use case to call unsubscribe() in the callback?

@abbccdda
Copy link
Copy Markdown
Author

@apurvam It was reproduced from the soak test. Will discuss with the team whether we need a stream side test as the case to mock here is a little bit tricky.

@guozhangwang
Copy link
Copy Markdown
Contributor

@mjsax I think in streams case we are not unsubscribe during poll, as a rebalance may be completed in several consecutive poll calls, and we call unsubscribe in between, which to the consumer's pov is totally legitimate.

@abbccdda I think this is not a streams-only issue, as mentioned above, any consumer could call unsubscribe after a poll call while the rebalance is actually not completed yet. It is not end of world that we did not capture it previously until we hit in soak, since unsubscribe is not a common callee indeed, and with that regard I think covering this on the consumer side is fine.

@guozhangwang
Copy link
Copy Markdown
Contributor

LGTM! Merging to trunk and 2.5

@guozhangwang guozhangwang merged commit c75dc5e into apache:trunk Mar 21, 2020
guozhangwang pushed a commit that referenced this pull request Mar 21, 2020
…8324)

This bug was incurred by #7994 with a too-strong consistency check. It is because a reset generation operation could be called in between the joinGroupRequest -> joinGroupResponse -> SyncGroupRequest -> SyncGroupResponse sequence of events, if user calls unsubscribe in the middle of consumer#poll().

Proper fix is to avoid the protocol name check when the generation is invalid.

Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
@mjsax mjsax added the consumer label Mar 21, 2020
@dajac
Copy link
Copy Markdown
Member

dajac commented Mar 21, 2020

Nice find!

Comment on lines +916 to +917
return protocolName != null && generation() != Generation.NO_GENERATION
&& !protocolName.equals(generation().protocolName);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang @abbccdda Shouldn't we synchronise this or use a local reference of the generation to be 100% safe?

@dajac
Copy link
Copy Markdown
Member

dajac commented Mar 23, 2020

@guozhangwang @abbccdda While I agree with the fix to unblock the release, I wonder if it wouldn't be better to explicitly handle this case in the SyncGroupResponseHandler and fail the rebalance with when it happens. I think that an explicit handling would avoid such subtile mistakes in the future. We already do something like this in the JoinGroupResponseHandler. What do you think?

@alfhv
Copy link
Copy Markdown

alfhv commented May 11, 2022

I'm running on version 2.6.1 and I'm having exactly same issue described here.
During a DR test some kafka nodes and consumers went down. When consumers were restarted they were rejected by broker with InconsistencyGroupProtocol exception.
Stopping and restarting all consumers solved the issue.
It is possible that scenario described on this thread could arrive on version 2.6.1 (maybe due to other reason)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants