KAFKA-13059: Make DeleteConsumerGroupOffsetsHandler unmap for COORDINATOR_NOT_AVAILABLE error and fix issue#11019
Conversation
There was a problem hiding this comment.
put every error into partitionResults, as the log logic did
There was a problem hiding this comment.
Refer to #11016, we don't return any completed/failed results if we need to retry.
| public void testDeleteConsumerGroupOffsets() throws Exception { | ||
| // Happy path | ||
|
|
||
| public void testDeleteConsumerGroupOffsetsResponseIncludeCoordinatorErrorAndNoneError() throws Exception { |
There was a problem hiding this comment.
Add a test to include coordinator error and other None errors in all partition response. We should retry it, too.
|
@dajac @rajinisivaram @mimaison , please help take a look. Thanks. |
| partitions.put(new TopicPartition(topic.name(), partition.partitionIndex()), partitionError); | ||
| final Map<TopicPartition, Errors> partitionResults = new HashMap<>(); | ||
| response.data().topics().forEach(topic -> | ||
| topic.partitions().forEach(partitionoffsetDeleteResponse -> { |
There was a problem hiding this comment.
nit: Should we keep partition instead of partitionoffsetDeleteResponse? It is a bit more concise.
| Errors partitionError = Errors.forCode(partitionoffsetDeleteResponse.errorCode()); | ||
| TopicPartition topicPartition = new TopicPartition(topic.name(), partitionoffsetDeleteResponse.partitionIndex()); | ||
| if (partitionError != Errors.NONE) { | ||
| handlePartitionError(groupId, partitionError, topicPartition, groupsToUnmap, groupsToRetry); |
There was a problem hiding this comment.
I am actually not sure about this. Looking at the code on the broker side, it seems that group errors are always returned in the top level error field. I think that we could simply return the partition errors without checking them.
There was a problem hiding this comment.
Yes, I was doing the way you suggested, but there's test failed due to that change: testDeleteConsumerGroupOffsetsNumRetries in KafkaAdminClientTest. It put the NOT_COORDINATOR in partition error, and expected to retry. That's why I changed to this.
What do you think?
There was a problem hiding this comment.
I see. I think that it used to work because ConsumerGroupOperationContext.hasCoordinatorMoved relied on response.errorCount(). I think that the unit test is incorrect in this case.
| log.error("Received non retriable error for group {} in `{}` response", groupId, | ||
| apiName(), error.exception()); |
There was a problem hiding this comment.
Could we try to uniformize the error messages? For instance OffsetDelete request for group id {} failed due to error {}. I would also print it as debug and we don't need to provide the exception to the logger. The exception doesn't bring much here.
| groupsToUnmap.add(groupId); | ||
| break; | ||
| default: | ||
| final String unexpectedErrorMsg = String.format("Received unexpected error for group %s in `%s` response", |
There was a problem hiding this comment.
unexpectedErrorMsg is not necessary as used only once. I would also follow the same partern that we use for other messages.
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`{}` request for group {} failed because the coordinator" + | ||
| " is still in the process of loading state. Will retry.", apiName(), groupId); |
There was a problem hiding this comment.
I am not a fan of using apiName() here because the name offsetDelete does not start with a capital letter.
| Map<CoordinatorKey, Map<TopicPartition, Errors>> completed = new HashMap<>(); | ||
| Map<CoordinatorKey, Throwable> failed = new HashMap<>(); | ||
| List<CoordinatorKey> unmapped = new ArrayList<>(); | ||
| final Set<CoordinatorKey> groupsToUnmap = new HashSet<>(); |
There was a problem hiding this comment.
Not related to this line. Is it worth verifying that groupIds only contains the expected groupId here and in buildRequest? I did it here: https://github.com/apache/kafka/pull/11016/files#diff-72f508d8e6b9b7f8fde5de8b75bedb6e7985824b71d00fb172338ec9c4782651R121.
| final Errors error = Errors.forCode(response.data().errorCode()); | ||
| if (error != Errors.NONE) { | ||
| handleError(groupId, error, failed, unmapped); | ||
| handleGroupError(groupId, error, failed, groupsToUnmap, groupsToRetry); |
There was a problem hiding this comment.
It seems that groupsToRetry is not really necessary in this case. Moreover, we could directly return in the branch as we don't expect errors in the partitions.
if (error != Errors.NONE) {
final Map<CoordinatorKey, Throwable> failed = new HashMap<>();
final Set<CoordinatorKey> groupsToUnmap = new HashSet<>();
handleGroupError(groupId, error, failed, groupsToUnmap);
return new ApiResult<>(Collections.emptyMap(), failed, new ArrayList<>(groupsToUnmap);
}
groupId will be either in failed or in groupsToUnmap after the call to handleGroupError.
There was a problem hiding this comment.
good suggestion! Updated!
| if (!partitions.isEmpty()) | ||
| completed.put(groupId, partitions); | ||
|
|
||
| completed.put(groupId, partitionResults); |
There was a problem hiding this comment.
Could we directly return here as well?
return new ApiResult<>(Collections.singletonMap(groupId, partitionResults), Collections.emptyList(), Collections.emptyList()) ;
I think that it will make the error handling a bit more explicit.
| log.error("Received non retriable error for group {} in `DeleteConsumerGroupOffsets` response", groupId, | ||
| error.exception()); | ||
| case NON_EMPTY_GROUP: | ||
| log.debug("`OffsetDelete` request for group id {} failed due to error {}.", groupId, error); |
There was a problem hiding this comment.
nit: groupId -> groupId.idValue. There are few other cases.
There was a problem hiding this comment.
Nice catch! I'll also update other PRs.
| break; | ||
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`OffsetDelete` request for group {} failed because the coordinator" + |
There was a problem hiding this comment.
Updated. I'll also update other PRs.
| return true; | ||
| // If the coordinator is unavailable or there was a coordinator change, then we unmap | ||
| // the key so that we retry the `FindCoordinator` request | ||
| log.debug("`OffsetDelete` request for group {} returned error {}. " + |
| new OffsetDeleteResponseData() | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(Stream.of( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("foo") | ||
| .setPartitions(new OffsetDeleteResponsePartitionCollection(Collections.singletonList( | ||
| new OffsetDeleteResponsePartition() | ||
| .setPartitionIndex(0) | ||
| .setErrorCode(Errors.NONE.code()) | ||
| ).iterator())), | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("bar") | ||
| .setPartitions(new OffsetDeleteResponsePartitionCollection(Collections.singletonList( | ||
| new OffsetDeleteResponsePartition() | ||
| .setPartitionIndex(0) | ||
| .setErrorCode(Errors.GROUP_SUBSCRIBED_TO_TOPIC.code()) | ||
| ).iterator())) | ||
| ).collect(Collectors.toList()).iterator())) |
There was a problem hiding this comment.
nit: Is it really better like this? Personally, I prefer the previous indentation.
There was a problem hiding this comment.
Sorry, I accidentally did it.
| .setThrottleTimeMs(0) | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(singletonList( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("t0") |
There was a problem hiding this comment.
nit: Could we rely on t0p0 here for the name and the partition?
| .setThrottleTimeMs(0) | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(singletonList( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("t0") |
| Collection<Map<TopicPartition, Errors>> completeCollection = result.completedKeys.values(); | ||
| assertEquals(1, completeCollection.size()); | ||
| Map<TopicPartition, Errors> completeMap = completeCollection.iterator().next(); | ||
| assertEquals(expectedResult, completeMap); |
There was a problem hiding this comment.
You already assert that completedKeys only contains key so it seems that we could just verify that result.completedKeys.get(key) is equal to expectedResult, no?
There was a problem hiding this comment.
Good suggestion! Updated.
| assertEquals(emptyList(), result.unmappedKeys); | ||
| assertEquals(emptySet(), result.failedKeys.keySet()); | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
nit: Could we add the empty line back?
| new OffsetDeleteResponseData() | ||
| .setThrottleTimeMs(0) | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(singletonList( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName(t0p0.topic()) | ||
| .setPartitions(new OffsetDeleteResponsePartitionCollection(singletonList( | ||
| new OffsetDeleteResponsePartition() | ||
| .setPartitionIndex(t0p0.partition()) | ||
| .setErrorCode(error.code()) | ||
| ).iterator())) | ||
| ).iterator())) | ||
| ); |
|
Failures are not related: |
…ATOR_NOT_AVAILABLE error (#11019) This patch improves the error handling in `DeleteConsumerGroupOffsetsHandler`. `COORDINATOR_NOT_AVAILABLE` is not unmapped to trigger a new find coordinator request to be sent out. Reviewers: David Jacot <djacot@confluent.io>
|
Merged to trunk and to 3.0. cc @kkonstantine |
…ATOR_NOT_AVAILABLE error (apache#11019) This patch improves the error handling in `DeleteConsumerGroupOffsetsHandler`. `COORDINATOR_NOT_AVAILABLE` is not unmapped to trigger a new find coordinator request to be sent out. Reviewers: David Jacot <djacot@confluent.io>
Some issues found in the
DeleteConsumerGroupOffsetsHandler:coordinator errorsis put in the topic partition, plus a Errors.NONE, we'll failed withIllegalArgumentException: Partition foo was not included in the original request. This is the new added test case scenario:testDeleteConsumerGroupOffsetsResponseIncludeCoordinatorErrorAndNoneErrorDeleteConsumerGroupOffsetsHandlerTest, we build all errors in partition result, including group error. Split group error tests and partition error tests.This is the old handle response logic. FYR:
Committer Checklist (excluded from commit message)