KAFKA-12352: Make sure all rejoin group and reset state has a reason by guozhangwang · Pull Request #10232 · apache/kafka

guozhangwang · 2021-03-01T06:46:59Z

Create a reason string to be used for INFO log entry whenever we request re-join or reset generation state.
Some minor cleanups.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

guozhangwang

guozhangwang · 2021-03-01T06:47:38Z

                    }
                }
            } else {
-                requestRejoin();


We can remove this since it is a bit redundant now as we call for each case if necessary.

Just to clarify, you mean we don't need to rejoin here since we will always raise an error, and always rejoin (if necessary) when checking that error?

Or are you referring to the requestRejoinOnResponseError calls you added to the two last cases in the below if/else?

I meant the latter: we call that inside the conditions already -- for those fatal errors, we do not need to call this anyways since the consumer will throw and crash.

@guozhangwang I think something may have been messed up during a merge/rebase: I no longer see requestRejoinOnResponseError being invoked anywhere

I added that function for sync group handler that handles retriable COORDINATOR_NOT_AVAILABLE / NOT_COORDINATOR and any unexpected error. After the refactoring PR they are not all fall into the joinGroupIfNeeded in

final RuntimeException exception = future.exception(); resetJoinGroupFuture(); if (exception instanceof UnknownMemberIdException || exception instanceof IllegalGenerationException || exception instanceof RebalanceInProgressException || exception instanceof MemberIdRequiredException) continue; else if (!future.isRetriable()) throw exception; resetStateAndRejoin(String.format("rebalance failed with retriable error %s", exception)); timer.sleep(rebalanceConfig.retryBackoffMs);

This is part of the principle I mentioned:

We may reset generationa and request rejoin in two different places: 1) in join/sync-group handler, and 2) in joinGroupIfNeeded, when the future is received. The principle is that these two should not overlap, and 2) is used as a fallback for those common errors from join/sync that we do not handle specifically.

But I forgot to remove this function as part of the second pass; will remove.

Ok cool, thanks. One last question then: after this refactoring, since we no longer call requestRejoinOnResponseError below, should we re-add the requestRejoin() call here? Or add a requestRejoin to the specific cases in the SyncGroup handler, eg

} else if (error == Errors.REBALANCE_IN_PROGRESS) { log.info("SyncGroup failed: The group began another rebalance. Need to re-join the group. " + "Sent generation was {}", sentGeneration); future.raise(error); }

I think we do not need to, since it would be called on resetStateAndRejoin(String.format("rebalance failed with retriable error %s", exception)); --- previously we are calling rejoin double times.

Hmm...but resetStateAndRejoin(String.format("rebalance failed with retriable error %s", exception)); is only called in joinGroupIfNeeded which is only called in ensureActiveGroup, which is in turn only invoked in ConsumerCoordinator#poll.
That said, inside SyncGroupResponseHandler#handle we would already have rejoinNeeded = true and only set it to false if the SyncGroup succeeds. So for that reason I guess we don't need the requestRejoin anywhere inside the SyncGroup handler

guozhangwang · 2021-03-01T06:49:18Z

        @Override
        public void onFailure(RuntimeException e, RequestFuture<Void> future) {
-            log.debug("FindCoordinator request failed due to {}", e);
+            log.debug("FindCoordinator request failed due to {}", e.toString());


Minor cleanup, we only need to print the error message but not the stack trace.

guozhangwang · 2021-03-01T06:50:07Z

    }

    synchronized void resetGenerationOnResponseError(ApiKeys api, Errors error) {
-        log.debug("Resetting generation after encountering {} from {} response and requesting re-join", error, api);


Note that I intentionally bumped up the log level from debug to info here since I think this is necessarily a message that users should pay attention to in production, where they mostly use INFO. Open for counter suggestions though.

SGTM. If we find it flooding the logs and not helpful we can reconsider

ableegoldman

Build failed with Execution failed for task ':connect:runtime:compileJava', I guess trunk is broken atm?

ableegoldman · 2021-03-01T21:42:32Z

                    }
                }
            } else {
-                requestRejoin();


Just to clarify, you mean we don't need to rejoin here since we will always raise an error, and always rejoin (if necessary) when checking that error?

Or are you referring to the requestRejoinOnResponseError calls you added to the two last cases in the below if/else?

ableegoldman · 2021-03-01T21:44:38Z

    }

    synchronized void resetGenerationOnResponseError(ApiKeys api, Errors error) {
-        log.debug("Resetting generation after encountering {} from {} response and requesting re-join", error, api);


SGTM. If we find it flooding the logs and not helpful we can reconsider

ableegoldman · 2021-03-01T21:46:06Z

    }

-    private synchronized void resetState() {
+    private synchronized void resetState(final String reason) {


nit: rename to resetStateAndGeneration?

ableegoldman · 2021-03-01T21:49:58Z

                    log.info("SyncGroup failed: {} Marking coordinator unknown. Sent generation was {}",
                             error.message(), sentGeneration);
                    markCoordinatorUnknown(error);
+                    requestRejoinOnResponseError(ApiKeys.SYNC_GROUP, error);


Why do we explicitly rejoin in this case, but not eg REBALANCE_IN_PROGRESS? or UNKNOWN_MEMBER_ID/ILLEGAL_GENERATION ?

You're right, we do not, I've updated this section.

…ebalance-trigger-event-logging

guozhangwang · 2021-03-06T07:50:24Z

We may reset generationa and request rejoin in two different places: 1) in join/sync-group handler, and 2) in joinGroupIfNeeded, when the future is received. The principle is that these two should not overlap, and 2) is used as a fallback for those common errors from join/sync that we do not handle specifically.

…ebalance-trigger-event-logging

ableegoldman

LGTM, thanks for the improvement! Feel free to merge if the build passes

ableegoldman · 2021-03-13T01:43:29Z

Failed with unrelated connect.integration.RebalanceSourceConnectorsIntegrationTest.testMultipleWorkersRejoining() and kafka.server.ScramServerStartupTest.testAuthentications()

Conflicts: * Jenkinsfile: `install` -> `publishToMavenLocal`, drop ARM build and other changes that don't make sense for Confluent's version of `Jenkinsfile`. * build.gradle: keep Confluent changes for automatic skipping signing for specific version patterns (upstream only does it if the version ends with `SNAPSHOT`). Commits: * apache-github/trunk: (59 commits) MINOR: Remove redundant allows in import-control.xml (apache#10339) MINOR: remove some specifying types in tool command (apache#10329) KAFKA-12455: Fix OffsetValidationTest.test_broker_rolling_bounce failure with Raft (apache#10322) MINOR: Add toString to various Kafka Metrics classes (apache#10330) KAFKA-12330; FetchSessionCache may cause starvation for partitions when FetchResponse is full (apache#10318) KAFKA-12427: Don't update connection idle time for muted connections (apache#10267) MINOR; Various code cleanups (apache#10319) HOTFIX: timeout issue in removeStreamThread() (apache#10321) revert stream logging level back to ERROR (apache#10320) KAFKA-12352: Make sure all rejoin group and reset state has a reason (apache#10232) KAFKA-10348: Share client channel between forwarding and auto creation manager (apache#10135) MINOR: Update year in NOTICE (apache#10308) KAFKA-12398: Fix flaky test `ConsumerBounceTest.testClose` (apache#10243) MINOR: Remove redundant inheritance from FilteringJmxReporter #onMetricRemoved (apache#10303) KAFKA-12462: proceed with task revocation in case of thread in PENDING_SHUTDOWN (apache#10311) KAFKA-12460; Do not allow raft truncation below high watermark (apache#10310) MINOR: Log project, gradle, java and scala versions at the start of the build (apache#10307) KAFKA-10357: Add missing repartition topic validation (apache#10305) MINOR: Improve error message in MirrorConnectorsIntegrationBaseTest (apache#10268) MINOR: Add missing unit tests for Mirror Connect (apache#10192) ...

make sure all rejoin groupa and reset state has a reason

b17fc78

guozhangwang commented Mar 1, 2021

View reviewed changes

ableegoldman reviewed Mar 1, 2021

View reviewed changes

dengziming mentioned this pull request Mar 3, 2021

MINOR: Fix log format in AbstractCoordinator #10247

Closed

3 tasks

guozhangwang added 3 commits March 4, 2021 16:49

Merge branch 'trunk' of https://github.com/apache/kafka into K12352-r…

b4646ae

…ebalance-trigger-event-logging

Merge branch 'trunk' of https://github.com/apache/kafka into K12352-r…

f2cf3e8

…ebalance-trigger-event-logging

github comments

c36c6a7

guozhangwang added 4 commits March 6, 2021 08:04

incorporate connect changes

ac15c59

Merge branch 'trunk' of https://github.com/apache/kafka into K12352-r…

66a116d

…ebalance-trigger-event-logging

remove unused imports

2bdb62c

remove unused function

c8cb7ab

ableegoldman approved these changes Mar 13, 2021

View reviewed changes

guozhangwang merged commit 2387d19 into apache:trunk Mar 15, 2021

guozhangwang deleted the K12352-rebalance-trigger-event-logging branch March 15, 2021 16:24

ijuma mentioned this pull request Mar 17, 2021

CONFLUENT: Sync from apache/kafka/trunk (17 March 2021) confluentinc/kafka#536

Merged

3 tasks

Conversation

guozhangwang commented Mar 1, 2021

Committer Checklist (excluded from commit message)

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Mar 6, 2021

Uh oh!

ableegoldman left a comment

Choose a reason for hiding this comment

Uh oh!

ableegoldman commented Mar 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants