KAFKA-8225 & KIP-345 part-2: fencing static member instances with conflicting group.instance.id by abbccdda · Pull Request #6650 · apache/kafka

abbccdda · 2019-04-29T23:48:33Z

For static members join/rejoin, we encode the current timestamp in the new member.id. The format looks like group.instance.id-timestamp.

During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether group.instance.id is known on group. If yes, we shall match the member.id stored on static membership map with the request member.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down.

Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

abbccdda · 2019-04-30T06:53:43Z

Retest this please

abbccdda · 2019-05-01T20:08:16Z

@hachikuji @guozhangwang Could you do a review when you got time? Thanks!

abbccdda · 2019-05-01T22:02:58Z

Retest this please.

guozhangwang

Made a pass over the non-testing code.

guozhangwang · 2019-05-08T01:23:52Z

Should this happen under normal case? If not we should log it as ERROR and return false.

We are being extra cautious here since we don't want to unexpectedly fence any member without this structure. Returning true means we don't check this case in static member.id validation, but it is not guaranteed to be valid.

If we think that client code should not validly set their member.ids, i.e. they should always be set by brokers and broker code always set it in a way of *-*, then this should not happen and I think it is okay to treat as a fatal error and reject the client request.

guozhangwang · 2019-05-09T23:24:26Z

I'm actually thinking we should just reject the request all together if its member-id is ill-formatted (of course, not in this function then, but rather do that at the very beginning of every request handling). But if you feel it is good enough with sound rationales, I'm fine to leave it as is.

@guozhangwang yea the reasoning is like we need to do a lot of unit test refactoring in this case. I could do that in a separate PR.

Sounds good. I agree it may be out of the scope for this PR.

guozhangwang · 2019-05-09T23:25:26Z

The non-testing code looks promising to me now except a few minor questions above.

Will move on to the testing code itself. Maybe @hachikuji can take another look.

abbccdda · 2019-05-09T23:58:48Z

@guozhangwang Thank you for the review!

guozhangwang

Made a pass on non-testing code as well. It lgtm modulo a few minor comments.

cc @hachikuji

hachikuji

@abbccdda Thanks, left a few comments.

hachikuji · 2019-05-11T23:26:57Z

Hmm.. Why are we throwing this instead of invoking the callback?

The goal here is to fail immediately, because we have detected a fenced exception in commit response queue, there should be no point retrying.

SGTM. It is symmetric to our ProducerFencedException, where for past event's triggered fencing, it would be thrown to callers directly; only if this call triggers the first fenced error exactly, it would be marked in the callback.

Hmm one sec, a not-very-common-but-possible-pattern may be:

user first call commitAsync, the response sent back has Fenced error; it would be kept.

user second call commitAsync, we would poll the completion object and throw immediately.

however, let's say user swallowed it and call commitAsync again, in this case we would proceed since the previous completion has been polled and hence there's no fence error any more, right?

I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.

I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.

Yes, that is what I was trying to get to above. Adding a FENCED state to MemberState would be a nice way to achieve this.

hachikuji · 2019-05-11T23:43:28Z

This might be clearer if we separate the static and dynamic cases explicitly:

groupInstanceId match { case Some(instanceId) => // Static member case None => // Dynamic member }

hachikuji · 2019-05-11T23:56:57Z

I think I still slightly prefer changing the protocols of SyncGroup, OffsetCommit, and Heartbeat so that the instance Id is an explicitly provided field. This implementation is not too terrible in terms of complexity, but it feels a tad hacky/dangerous to mix internal state with user-provided strings. Do you think there are any major downsides to modifying the protocols?

I think our previous discussion covers following trade-offs (by choosing timestamp solution):

Avoiding bumping up many group protocols all at once

Making debugging easier

Modifying many protocols all at once seems like over-kill, when we have good checking methodology already. Honestly I'm not fully convinced by the negative impact of bumping protocols, maybe @guozhangwang could chime in here and explain a bit on the pros and cons.

Here I still second the current approach because of 2): right now we don't have good mechanism to debug static membership issue, and meaningful member.id with tracking time seems very helpful compared with random generated id. WDYT?

I think using the generation scheme we have here for member.id is a good idea in any case. Whether we use it for fencing detection is a separate thing. My argument is basically that we've identified a shortcoming of the protocol and we should try to fix it through protocol instead of by a side channel. I personally think bumping the protocol is not too big of a deal. (I realize the side channel was my idea, but I agreed with general feedback that it was a bit of a hack.)

hachikuji

Thanks, left a few comments.

hachikuji · 2019-05-17T15:07:20Z

                    || error == Errors.GROUP_AUTHORIZATION_FAILED
-                    || error == Errors.GROUP_MAX_SIZE_REACHED
-                    || error == Errors.FENCED_INSTANCE_ID) {
+                    || error == Errors.GROUP_MAX_SIZE_REACHED) {


If we are fenced, should we keep track of that somewhere so that we do not keep sending RPCs to the coordinator?

As long as we don't reset our generation info, all subsequent requests should be failing once other consumer joins the group right? Eventually this will lead to a complete crash IIUC.

For JoinResponse specifically, it should be caught in line 427 above and then falls into else if (!future.isRetriable()) to throw the exception to the callers immediately. So I agree with @abbccdda that no extra logic would be needed.

I think I wasn't clear. What I'm asking is whether the consumer should remember the fact that it was fenced. So if the user continues trying to do stuff, we fail immediately instead of sending additional requests to the broker.

For Join/Sync/OffsetCommitSync the failures should be immediate; for heartbeat/commitAsync it would not be immediate but will be quickly surfaced. If we do want a global variable indicating the failure, potentially we need to add a new MemberState

Yes, a new MemberState would be a nice way to handle this. We can do this later if you do not think it is important now.

hachikuji · 2019-05-17T15:49:26Z


+  def getGroupInstanceId(rawInstanceId: String): Option[String] = {
+      if (rawInstanceId == null ||
+        config.interBrokerProtocolVersion < KAFKA_2_3_IV0)


I wonder if it's sufficient to do this for JoinGroup only. Basically we just try to gate entrance into the group.

By the way, it's worth add a comment in the JoinGroup handler explaining the purpose of the IBP check. Took me a little while to recall that it was tied to the schema we use in the offsets topic.

This pre-check should keep the behavior in GroupCoordinator consistent IMO.

The problem is that a static member which already received a JoinGroup response could be downgraded to a dynamic member. Have you thought through what the implications of this are?

We have IBP to guard against turning a member into static member. So as long as IBP < 2.3, we will not see a downgrade correct?

guozhangwang

Made another pass, I only have a question about commit fenced error handling. The key is that once a consumer is fenced, it should always be in that state such that all following calls are thrown immediately. Others are minor.

guozhangwang · 2019-05-17T18:42:40Z

                    || error == Errors.GROUP_AUTHORIZATION_FAILED
-                    || error == Errors.GROUP_MAX_SIZE_REACHED
-                    || error == Errors.FENCED_INSTANCE_ID) {
+                    || error == Errors.GROUP_MAX_SIZE_REACHED) {


For JoinResponse specifically, it should be caught in line 427 above and then falls into else if (!future.isRetriable()) to throw the exception to the callers immediately. So I agree with @abbccdda that no extra logic would be needed.

guozhangwang · 2019-05-17T18:57:17Z

SGTM. It is symmetric to our ProducerFencedException, where for past event's triggered fencing, it would be thrown to callers directly; only if this call triggers the first fenced error exactly, it would be marked in the callback.

guozhangwang · 2019-05-17T19:14:04Z

Hmm one sec, a not-very-common-but-possible-pattern may be:

user first call commitAsync, the response sent back has Fenced error; it would be kept.

user second call commitAsync, we would poll the completion object and throw immediately.

however, let's say user swallowed it and call commitAsync again, in this case we would proceed since the previous completion has been polled and hence there's no fence error any more, right?

I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.

abbccdda · 2019-05-17T23:54:51Z

@guozhangwang @hachikuji Addressed all comments and replicate fencing logic to all group related protocols. Filed a separate JIRA to track the group error change:
https://issues.apache.org/jira/browse/KAFKA-8386

hachikuji

A few more small comments.

hachikuji · 2019-05-18T00:00:37Z

                                            heartbeat.receiveHeartbeat();
+                                        } else if (e instanceof FencedInstanceIdException) {
+                                            log.error("Caught fenced group.instance.id {} error in heartbeat thread", groupInstanceId);
+                                            heartbeatThread.failed.set(e);


Should we return after we fail the heartbeat thread? We do no want it to keep running I assume.

We are in a if-else branch here, but I agree. In case someone adds logic after if-else block in the future.

Oh, actually it's against code style, so just leave it.

We do need a way to stop the heartbeat thread still, right? Perhaps we can invoke disable()?

Oh, got it. Let's stop it through disable() then

hachikuji · 2019-05-18T00:01:11Z

                log.debug("Attempt to join group failed due to obsolete coordinator information: {}", error.message());
                future.raise(error);
+            } else if (error == Errors.FENCED_INSTANCE_ID) {
+                log.error("Received fatal exception: group.instance.id {} gets fenced", groupInstanceId);


We can leave the instance id out of this message since we added it to the log context. There are a few more of these below.

Sounds good!

hachikuji · 2019-05-18T00:03:35Z

                    || error == Errors.GROUP_AUTHORIZATION_FAILED
-                    || error == Errors.GROUP_MAX_SIZE_REACHED
-                    || error == Errors.FENCED_INSTANCE_ID) {
+                    || error == Errors.GROUP_MAX_SIZE_REACHED) {


Yes, a new MemberState would be a nice way to handle this. We can do this later if you do not think it is important now.

hachikuji · 2019-05-18T00:07:23Z

I think the contract should be, once the consumer falls into the fenced state, it should always be in that state and hence always throw immediately for any following function calls. In this sense, we should probably change this logic a bit.

Yes, that is what I was trying to get to above. Adding a FENCED state to MemberState would be a nice way to achieve this.

abbccdda · 2019-05-18T00:27:31Z

@hachikuji Addressed new comments, and another JIRA to track Fenced state :)
https://issues.apache.org/jira/browse/KAFKA-8387

hachikuji

LGTM. Thanks for the patch!

guozhangwang · 2019-05-18T01:56:26Z

LGTM. Waiting for green builds

guozhangwang · 2019-05-18T14:28:51Z

Merged to trunk, kudos @abbccdda !!

…flicting group.instance.id (apache#6650) For static members join/rejoin, we encode the current timestamp in the new member.id. The format looks like group.instance.id-timestamp. During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether group.instance.id is known on group. If yes, we shall match the member.id stored on static membership map with the request member.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down. Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed. Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>

abbccdda changed the title ~~KAFKA-8225: fencing static member instances with conflicting group.instance.id~~ KAFKA-8225: fencing static member instances with conflicting group.instance.id [WIP] Apr 29, 2019

abbccdda force-pushed the fencing_instance branch 2 times, most recently from fd37311 to 322225d Compare April 30, 2019 01:51

abbccdda force-pushed the fencing_instance branch 5 times, most recently from caf3ff5 to a8bf61e Compare May 1, 2019 19:29

abbccdda changed the title ~~KAFKA-8225: fencing static member instances with conflicting group.instance.id [WIP]~~ KAFKA-8225: fencing static member instances with conflicting group.instance.id May 2, 2019

guozhangwang reviewed May 8, 2019

View reviewed changes

abbccdda force-pushed the fencing_instance branch 4 times, most recently from cbe5e4e to 15bdaa7 Compare May 9, 2019 17:37

guozhangwang reviewed May 9, 2019

View reviewed changes

Comment thread clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java Outdated

guozhangwang reviewed May 9, 2019

View reviewed changes

abbccdda changed the title ~~KAFKA-8225: fencing static member instances with conflicting group.instance.id~~ KAFKA-8225 & KIP-345 part-2: fencing static member instances with conflicting group.instance.id May 10, 2019

guozhangwang reviewed May 11, 2019

View reviewed changes

Comment thread clients/src/test/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinatorTest.java Outdated

Comment thread clients/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinatorTest.java Outdated

hachikuji reviewed May 11, 2019

View reviewed changes

ryannedolan reviewed May 14, 2019

View reviewed changes

Comment thread core/src/main/scala/kafka/coordinator/group/GroupMetadata.scala Outdated

Comment thread core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala Outdated

Comment thread core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala Outdated

abbccdda added 5 commits May 16, 2019 13:33

fencing instance

564b908

fix test case and comment

341a84d

fencing system test

a42a00d

address comments

6927cc5

timestamp validation

08f7949

abbccdda force-pushed the fencing_instance branch from c22ae3b to 4341085 Compare May 17, 2019 04:36

hachikuji reviewed May 17, 2019

View reviewed changes

abbccdda force-pushed the fencing_instance branch 2 times, most recently from 3ad191d to bc1d096 Compare May 17, 2019 17:27

address more comments

046d2a0

abbccdda force-pushed the fencing_instance branch from bc1d096 to 046d2a0 Compare May 17, 2019 18:16

address comments

a488633

guozhangwang reviewed May 17, 2019

View reviewed changes

fenced commit

b8ef1cd

abbccdda force-pushed the fencing_instance branch from 7f53609 to b8ef1cd Compare May 17, 2019 20:45

abbccdda added 2 commits May 17, 2019 14:18

add logic to reject invalid requests

2bd5524

unit tests for version exception

e32cbdb

guozhangwang reviewed May 17, 2019

View reviewed changes

Comment thread core/src/main/scala/kafka/server/KafkaApis.scala Outdated

guozhangwang reviewed May 17, 2019

View reviewed changes

Comment thread clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java Outdated

address tests and replicate version fencing logic

ab307c9

fix commit fence

a93c23c

hachikuji reviewed May 18, 2019

View reviewed changes

address new comments

9ae1431

hachikuji approved these changes May 18, 2019

View reviewed changes

minor: disable hb thread when fenced

82f1f32

abbccdda force-pushed the fencing_instance branch from 792edbc to 82f1f32 Compare May 18, 2019 01:08

fix unit test

4a0c245

guozhangwang merged commit 9fa331b into apache:trunk May 18, 2019

Nevon mentioned this pull request Sep 22, 2020

Group Instance ID support tulios/kafkajs#884

Open

Conversation

abbccdda commented Apr 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

abbccdda commented Apr 30, 2019

Uh oh!

abbccdda commented May 1, 2019

Uh oh!

abbccdda commented May 1, 2019

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented May 9, 2019

Uh oh!

abbccdda commented May 9, 2019

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hachikuji May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abbccdda commented Apr 29, 2019 •

edited

Loading

hachikuji May 11, 2019 •

edited

Loading

hachikuji May 13, 2019 •

edited

Loading

abbccdda May 17, 2019 •

edited

Loading