-
Notifications
You must be signed in to change notification settings - Fork 15.2k
KAFKA-13033: COORDINATOR_NOT_AVAILABLE should be unmapped #10973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4fd9408
fe76481
412f12b
befadd7
ea0ce29
f9db3e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -111,6 +111,7 @@ public ApiResult<CoordinatorKey, Map<TopicPartition, Errors>> handleResponse( | |
| List<CoordinatorKey> unmapped = new ArrayList<>(); | ||
|
|
||
| Map<TopicPartition, Errors> partitions = new HashMap<>(); | ||
| int totalPartitionCount = 0; | ||
| for (OffsetCommitResponseTopic topic : response.data().topics()) { | ||
| for (OffsetCommitResponsePartition partition : topic.partitions()) { | ||
| TopicPartition tp = new TopicPartition(topic.name(), partition.partitionIndex()); | ||
|
|
@@ -120,9 +121,14 @@ public ApiResult<CoordinatorKey, Map<TopicPartition, Errors>> handleResponse( | |
| } else { | ||
| partitions.put(tp, error); | ||
| } | ||
| totalPartitionCount++; | ||
| } | ||
| } | ||
| if (failed.isEmpty() && unmapped.isEmpty()) | ||
| // only complete this request when: | ||
| // 1. no fail | ||
| // 2. no unmapped | ||
| // 3. all partitions are handled (i.e. no need to retry) | ||
| if (failed.isEmpty() && unmapped.isEmpty() && partitions.size() == totalPartitionCount) | ||
| completed.put(groupId, partitions); | ||
|
|
||
| return new ApiResult<>(completed, failed, unmapped); | ||
|
|
@@ -136,21 +142,28 @@ private void handleError( | |
| ) { | ||
| switch (error) { | ||
| case GROUP_AUTHORIZATION_FAILED: | ||
| log.error("Received authorization failure for group {} in `OffsetCommit` response", groupId, | ||
| error.exception()); | ||
| log.error("Received authorization failure for group {} in `{}` response", groupId, | ||
| apiName(), error.exception()); | ||
| failed.put(groupId, error.exception()); | ||
| break; | ||
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`{}` request for group {} failed because the coordinator" + | ||
| " is still in the process of loading state. Will retry.", apiName(), groupId); | ||
| break; | ||
| case COORDINATOR_NOT_AVAILABLE: | ||
| case NOT_COORDINATOR: | ||
| log.debug("OffsetCommit request for group {} returned error {}. Will retry", groupId, error); | ||
| // If the coordinator is unavailable or there was a coordinator change, then we unmap | ||
| // the key so that we retry the `FindCoordinator` request | ||
| log.debug("`{}` request for group {} returned error {}. " + | ||
| "Will attempt to find the coordinator again and retry.", apiName(), groupId, error); | ||
| unmapped.add(groupId); | ||
| break; | ||
| default: | ||
| log.error("Received unexpected error for group {} in `OffsetCommit` response", | ||
| groupId, error.exception()); | ||
| failed.put(groupId, error.exception( | ||
| "Received unexpected error for group " + groupId + " in `OffsetCommit` response")); | ||
| final String unexpectedErrorMsg = String.format("Received unexpected error for group %s in `%s` response", | ||
| groupId, apiName()); | ||
| log.error(unexpectedErrorMsg, error.exception()); | ||
| failed.put(groupId, error.exception(unexpectedErrorMsg)); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The error handling bugs me a bit. It seems to me that we should differentiate the group level errors from the partition level errors here or we should consider all of them as partition level errors. What do you think? Also, I think that we should handle all the expected errors here. The default error message here is wrong. There are many errors which expect but which are not handled.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dajac , thanks for your comment.
I agree with you. I think what KIP-699 did, is trying to not break existing tests. It indeed needs improvement.
You're right. I'm thinking we can handle them in separate PR, and open a Jira ticket to track it. And due to V3.0 is released, maybe that improvement can go into next release. What do you think?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @showuon I actually believe that we have regressed here. The admin api returns wrong results. I just tried with a small unit test: It works with 2.8 but fails with trunk. In trunk,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have opened a PR for this here: #11016. |
||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -104,15 +104,18 @@ public ApiResult<CoordinatorKey, Map<TopicPartition, Errors>> handleResponse( | |
|
|
||
| final Errors error = Errors.forCode(response.data().errorCode()); | ||
| if (error != Errors.NONE) { | ||
| handleError(groupId, error, failed, unmapped); | ||
| handleGroupLevelError(groupId, error, failed, unmapped); | ||
| } else { | ||
| final Map<TopicPartition, Errors> partitions = new HashMap<>(); | ||
| response.data().topics().forEach(topic -> | ||
| topic.partitions().forEach(partition -> { | ||
| Errors partitionError = Errors.forCode(partition.errorCode()); | ||
| if (!handleError(groupId, partitionError, failed, unmapped)) { | ||
| partitions.put(new TopicPartition(topic.name(), partition.partitionIndex()), partitionError); | ||
| topic.partitions().forEach(partitionOffsetDeleteResponse -> { | ||
| Errors partitionError = Errors.forCode(partitionOffsetDeleteResponse.errorCode()); | ||
| TopicPartition tp = new TopicPartition(topic.name(), partitionOffsetDeleteResponse.partitionIndex()); | ||
| if (log.isDebugEnabled() && partitionError != Errors.NONE) { | ||
| log.debug("`{}` request for group {} returned error {} in the partition {}.", | ||
| apiName(), groupId, partitionError, tp); | ||
| } | ||
| partitions.put(tp, partitionError); | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For partition level error, we don't handle it, just put them into completion result. |
||
| }) | ||
| ); | ||
| if (!partitions.isEmpty()) | ||
|
|
@@ -121,7 +124,7 @@ public ApiResult<CoordinatorKey, Map<TopicPartition, Errors>> handleResponse( | |
| return new ApiResult<>(completed, failed, unmapped); | ||
| } | ||
|
|
||
| private boolean handleError( | ||
| private void handleGroupLevelError( | ||
| CoordinatorKey groupId, | ||
| Errors error, | ||
| Map<CoordinatorKey, Throwable> failed, | ||
|
|
@@ -131,20 +134,29 @@ private boolean handleError( | |
| case GROUP_AUTHORIZATION_FAILED: | ||
| case GROUP_ID_NOT_FOUND: | ||
| case INVALID_GROUP_ID: | ||
| log.error("Received non retriable error for group {} in `DeleteConsumerGroupOffsets` response", groupId, | ||
| error.exception()); | ||
| case NON_EMPTY_GROUP: | ||
| log.error("Received non retriable error for group {} in `{}` response", groupId, | ||
| apiName(), error.exception()); | ||
| failed.put(groupId, error.exception()); | ||
| return true; | ||
| break; | ||
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`{}` request for group {} failed because the coordinator" + | ||
| " is still in the process of loading state. Will retry.", apiName(), groupId); | ||
| break; | ||
| case COORDINATOR_NOT_AVAILABLE: | ||
| return true; | ||
| case NOT_COORDINATOR: | ||
| log.debug("DeleteConsumerGroupOffsets request for group {} returned error {}. Will retry", | ||
| groupId, error); | ||
| // If the coordinator is unavailable or there was a coordinator change, then we unmap | ||
| // the key so that we retry the `FindCoordinator` request | ||
| log.debug("`{}` request for group {} returned error {}. " + | ||
| "Will attempt to find the coordinator again and retry.", apiName(), groupId, error); | ||
| unmapped.add(groupId); | ||
| return true; | ||
| break; | ||
| default: | ||
| return false; | ||
| final String unexpectedErrorMsg = String.format("Received unexpected error for group %s in `%s` response", | ||
| groupId, apiName()); | ||
| log.error(unexpectedErrorMsg, error.exception()); | ||
| failed.put(groupId, error.exception()); | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -112,21 +112,34 @@ private void handleError( | |
| ) { | ||
| switch (error) { | ||
| case GROUP_AUTHORIZATION_FAILED: | ||
| log.error("Received authorization failure for group {} in `DeleteConsumerGroups` response", groupId, | ||
| error.exception()); | ||
| log.error("Received authorization failure for group {} in `{}` response", groupId, | ||
| apiName(), error.exception()); | ||
| failed.put(groupId, error.exception()); | ||
| break; | ||
| case INVALID_GROUP_ID: | ||
| case NON_EMPTY_GROUP: | ||
| case GROUP_ID_NOT_FOUND: | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. handle all possible errors well. |
||
| log.error("Received non retriable failure for group {} in `{}` response", groupId, | ||
| apiName(), error.exception()); | ||
| failed.put(groupId, error.exception()); | ||
| break; | ||
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| case COORDINATOR_NOT_AVAILABLE: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`{}` request for group {} failed because the coordinator " + | ||
| "is still in the process of loading state. Will retry", apiName(), groupId); | ||
| break; | ||
| case COORDINATOR_NOT_AVAILABLE: | ||
| case NOT_COORDINATOR: | ||
| log.debug("DeleteConsumerGroups request for group {} returned error {}. Will retry", | ||
| groupId, error); | ||
| // If the coordinator is unavailable or there was a coordinator change, then we unmap | ||
| // the key so that we retry the `FindCoordinator` request | ||
| log.debug("`{}` request for group {} returned error {}. " + | ||
| "Will attempt to find the coordinator again and retry", apiName(), groupId, error); | ||
| unmapped.add(groupId); | ||
| break; | ||
| default: | ||
| log.error("Received unexpected error for group {} in `DeleteConsumerGroups` response", | ||
| groupId, error.exception()); | ||
| final String unexpectedErrorMsg = String.format("Received unexpected error for group %s in `%s` response", | ||
| groupId, apiName()); | ||
| log.error(unexpectedErrorMsg, error.exception()); | ||
| failed.put(groupId, error.exception()); | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, we also retry if we get a
COORDINATOR_NOT_AVAILABLEexception. Can we add aDEBUGlevel log statement stating this and that we will retry? Will be good to mimic what we do for the other retriable errors forCOORDINATOR_NOT_AVAILABLEerrors.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'll retry when
COORDINATOR_NOT_AVAILABLEerror. And we already log it below:We used a general way (with
errorvariable), to log when eitherCOORDINATOR_NOT_AVAILABLEorNOT_COORDINATORerror happened.I think it should be fine unless you have other suggestion. Thank you. :)