KAFKA-8225 & KIP-345 part-2: fencing static member instances with conflicting group.instance.id#6650
Conversation
fd37311 to
322225d
Compare
|
Retest this please |
caf3ff5 to
a8bf61e
Compare
|
@hachikuji @guozhangwang Could you do a review when you got time? Thanks! |
|
Retest this please. |
guozhangwang
left a comment
There was a problem hiding this comment.
Made a pass over the non-testing code.
There was a problem hiding this comment.
Should this happen under normal case? If not we should log it as ERROR and return false.
There was a problem hiding this comment.
We are being extra cautious here since we don't want to unexpectedly fence any member without this structure. Returning true means we don't check this case in static member.id validation, but it is not guaranteed to be valid.
There was a problem hiding this comment.
If we think that client code should not validly set their member.ids, i.e. they should always be set by brokers and broker code always set it in a way of *-*, then this should not happen and I think it is okay to treat as a fatal error and reject the client request.
cbe5e4e to
15bdaa7
Compare
There was a problem hiding this comment.
I'm actually thinking we should just reject the request all together if its member-id is ill-formatted (of course, not in this function then, but rather do that at the very beginning of every request handling). But if you feel it is good enough with sound rationales, I'm fine to leave it as is.
There was a problem hiding this comment.
@guozhangwang yea the reasoning is like we need to do a lot of unit test refactoring in this case. I could do that in a separate PR.
There was a problem hiding this comment.
Sounds good. I agree it may be out of the scope for this PR.
|
The non-testing code looks promising to me now except a few minor questions above. Will move on to the testing code itself. Maybe @hachikuji can take another look. |
|
@guozhangwang Thank you for the review! |
guozhangwang
left a comment
There was a problem hiding this comment.
Made a pass on non-testing code as well. It lgtm modulo a few minor comments.
cc @hachikuji
There was a problem hiding this comment.
Hmm.. Why are we throwing this instead of invoking the callback?
There was a problem hiding this comment.
The goal here is to fail immediately, because we have detected a fenced exception in commit response queue, there should be no point retrying.
There was a problem hiding this comment.
SGTM. It is symmetric to our ProducerFencedException, where for past event's triggered fencing, it would be thrown to callers directly; only if this call triggers the first fenced error exactly, it would be marked in the callback.
There was a problem hiding this comment.
Hmm one sec, a not-very-common-but-possible-pattern may be:
- user first call
commitAsync, the response sent back has Fenced error; it would be kept. - user second call
commitAsync, we would poll the completion object and throw immediately. - however, let's say user swallowed it and call
commitAsyncagain, in this case we would proceed since the previous completion has been polled and hence there's no fence error any more, right?
I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.
There was a problem hiding this comment.
I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.
Yes, that is what I was trying to get to above. Adding a FENCED state to MemberState would be a nice way to achieve this.
There was a problem hiding this comment.
This might be clearer if we separate the static and dynamic cases explicitly:
groupInstanceId match {
case Some(instanceId) =>
// Static member
case None =>
// Dynamic member
}There was a problem hiding this comment.
I think I still slightly prefer changing the protocols of SyncGroup, OffsetCommit, and Heartbeat so that the instance Id is an explicitly provided field. This implementation is not too terrible in terms of complexity, but it feels a tad hacky/dangerous to mix internal state with user-provided strings. Do you think there are any major downsides to modifying the protocols?
There was a problem hiding this comment.
I think our previous discussion covers following trade-offs (by choosing timestamp solution):
- Avoiding bumping up many group protocols all at once
- Making debugging easier
Modifying many protocols all at once seems like over-kill, when we have good checking methodology already. Honestly I'm not fully convinced by the negative impact of bumping protocols, maybe @guozhangwang could chime in here and explain a bit on the pros and cons.
Here I still second the current approach because of 2): right now we don't have good mechanism to debug static membership issue, and meaningful member.id with tracking time seems very helpful compared with random generated id. WDYT?
There was a problem hiding this comment.
I think using the generation scheme we have here for member.id is a good idea in any case. Whether we use it for fencing detection is a separate thing. My argument is basically that we've identified a shortcoming of the protocol and we should try to fix it through protocol instead of by a side channel. I personally think bumping the protocol is not too big of a deal. (I realize the side channel was my idea, but I agreed with general feedback that it was a bit of a hack.)
c22ae3b to
4341085
Compare
hachikuji
left a comment
There was a problem hiding this comment.
Thanks, left a few comments.
| || error == Errors.GROUP_AUTHORIZATION_FAILED | ||
| || error == Errors.GROUP_MAX_SIZE_REACHED | ||
| || error == Errors.FENCED_INSTANCE_ID) { | ||
| || error == Errors.GROUP_MAX_SIZE_REACHED) { |
There was a problem hiding this comment.
If we are fenced, should we keep track of that somewhere so that we do not keep sending RPCs to the coordinator?
There was a problem hiding this comment.
As long as we don't reset our generation info, all subsequent requests should be failing once other consumer joins the group right? Eventually this will lead to a complete crash IIUC.
There was a problem hiding this comment.
For JoinResponse specifically, it should be caught in line 427 above and then falls into else if (!future.isRetriable()) to throw the exception to the callers immediately. So I agree with @abbccdda that no extra logic would be needed.
There was a problem hiding this comment.
I think I wasn't clear. What I'm asking is whether the consumer should remember the fact that it was fenced. So if the user continues trying to do stuff, we fail immediately instead of sending additional requests to the broker.
There was a problem hiding this comment.
For Join/Sync/OffsetCommitSync the failures should be immediate; for heartbeat/commitAsync it would not be immediate but will be quickly surfaced. If we do want a global variable indicating the failure, potentially we need to add a new MemberState
There was a problem hiding this comment.
Yes, a new MemberState would be a nice way to handle this. We can do this later if you do not think it is important now.
|
|
||
| def getGroupInstanceId(rawInstanceId: String): Option[String] = { | ||
| if (rawInstanceId == null || | ||
| config.interBrokerProtocolVersion < KAFKA_2_3_IV0) |
There was a problem hiding this comment.
I wonder if it's sufficient to do this for JoinGroup only. Basically we just try to gate entrance into the group.
By the way, it's worth add a comment in the JoinGroup handler explaining the purpose of the IBP check. Took me a little while to recall that it was tied to the schema we use in the offsets topic.
There was a problem hiding this comment.
This pre-check should keep the behavior in GroupCoordinator consistent IMO.
There was a problem hiding this comment.
The problem is that a static member which already received a JoinGroup response could be downgraded to a dynamic member. Have you thought through what the implications of this are?
There was a problem hiding this comment.
We have IBP to guard against turning a member into static member. So as long as IBP < 2.3, we will not see a downgrade correct?
3ad191d to
bc1d096
Compare
bc1d096 to
046d2a0
Compare
guozhangwang
left a comment
There was a problem hiding this comment.
Made another pass, I only have a question about commit fenced error handling. The key is that once a consumer is fenced, it should always be in that state such that all following calls are thrown immediately. Others are minor.
| || error == Errors.GROUP_AUTHORIZATION_FAILED | ||
| || error == Errors.GROUP_MAX_SIZE_REACHED | ||
| || error == Errors.FENCED_INSTANCE_ID) { | ||
| || error == Errors.GROUP_MAX_SIZE_REACHED) { |
There was a problem hiding this comment.
For JoinResponse specifically, it should be caught in line 427 above and then falls into else if (!future.isRetriable()) to throw the exception to the callers immediately. So I agree with @abbccdda that no extra logic would be needed.
There was a problem hiding this comment.
SGTM. It is symmetric to our ProducerFencedException, where for past event's triggered fencing, it would be thrown to callers directly; only if this call triggers the first fenced error exactly, it would be marked in the callback.
There was a problem hiding this comment.
Hmm one sec, a not-very-common-but-possible-pattern may be:
- user first call
commitAsync, the response sent back has Fenced error; it would be kept. - user second call
commitAsync, we would poll the completion object and throw immediately. - however, let's say user swallowed it and call
commitAsyncagain, in this case we would proceed since the previous completion has been polled and hence there's no fence error any more, right?
I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.
7f53609 to
b8ef1cd
Compare
|
@guozhangwang @hachikuji Addressed all comments and replicate fencing logic to all group related protocols. Filed a separate JIRA to track the group error change: |
hachikuji
left a comment
There was a problem hiding this comment.
A few more small comments.
| heartbeat.receiveHeartbeat(); | ||
| } else if (e instanceof FencedInstanceIdException) { | ||
| log.error("Caught fenced group.instance.id {} error in heartbeat thread", groupInstanceId); | ||
| heartbeatThread.failed.set(e); |
There was a problem hiding this comment.
Should we return after we fail the heartbeat thread? We do no want it to keep running I assume.
There was a problem hiding this comment.
We are in a if-else branch here, but I agree. In case someone adds logic after if-else block in the future.
There was a problem hiding this comment.
Oh, actually it's against code style, so just leave it.
There was a problem hiding this comment.
We do need a way to stop the heartbeat thread still, right? Perhaps we can invoke disable()?
There was a problem hiding this comment.
Oh, got it. Let's stop it through disable() then
| log.debug("Attempt to join group failed due to obsolete coordinator information: {}", error.message()); | ||
| future.raise(error); | ||
| } else if (error == Errors.FENCED_INSTANCE_ID) { | ||
| log.error("Received fatal exception: group.instance.id {} gets fenced", groupInstanceId); |
There was a problem hiding this comment.
We can leave the instance id out of this message since we added it to the log context. There are a few more of these below.
| || error == Errors.GROUP_AUTHORIZATION_FAILED | ||
| || error == Errors.GROUP_MAX_SIZE_REACHED | ||
| || error == Errors.FENCED_INSTANCE_ID) { | ||
| || error == Errors.GROUP_MAX_SIZE_REACHED) { |
There was a problem hiding this comment.
Yes, a new MemberState would be a nice way to handle this. We can do this later if you do not think it is important now.
There was a problem hiding this comment.
I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.
Yes, that is what I was trying to get to above. Adding a FENCED state to MemberState would be a nice way to achieve this.
|
@hachikuji Addressed new comments, and another JIRA to track |
hachikuji
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the patch!
792edbc to
82f1f32
Compare
|
LGTM. Waiting for green builds |
|
Merged to trunk, kudos @abbccdda !! |
…flicting group.instance.id (apache#6650) For static members join/rejoin, we encode the current timestamp in the new member.id. The format looks like group.instance.id-timestamp. During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether group.instance.id is known on group. If yes, we shall match the member.id stored on static membership map with the request member.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down. Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed. Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
For static members join/rejoin, we encode the current timestamp in the new
member.id. The format looks likegroup.instance.id-timestamp.During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether
group.instance.idis known on group. If yes, we shall match themember.idstored on static membership map with the requestmember.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down.Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed.
Committer Checklist (excluded from commit message)