Skip to content

KAFKA-5950: AdminClient should retry based on returned error codes#4167

Closed
adyach wants to merge 2 commits intoapache:trunkfrom
adyach:KAFKA-5950
Closed

KAFKA-5950: AdminClient should retry based on returned error codes#4167
adyach wants to merge 2 commits intoapache:trunkfrom
adyach:KAFKA-5950

Conversation

@adyach
Copy link
Copy Markdown
Contributor

@adyach adyach commented Nov 1, 2017

The PR will add retries for KafkaAdminClient's retriable error responses.

@adyach adyach changed the title KAFKA-5950: retry on response retriable exception KAFKA-5950: AdminClient should retry based on returned error codes Nov 1, 2017
@asfgit
Copy link
Copy Markdown

asfgit commented Nov 1, 2017

FAILURE
No test results found.
--none--

2 similar comments
@asfgit
Copy link
Copy Markdown

asfgit commented Nov 1, 2017

FAILURE
No test results found.
--none--

@asfgit
Copy link
Copy Markdown

asfgit commented Nov 1, 2017

FAILURE
No test results found.
--none--

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 1, 2017

cc @cmccabe

@asfgit
Copy link
Copy Markdown

asfgit commented Nov 3, 2017

FAILURE
No test results found.
--none--

@asfgit
Copy link
Copy Markdown

asfgit commented Nov 3, 2017

FAILURE
7942 tests run, 5 skipped, 4 failed.
--none--

1 similar comment
@asfgit
Copy link
Copy Markdown

asfgit commented Nov 3, 2017

FAILURE
7942 tests run, 5 skipped, 4 failed.
--none--

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Nov 6, 2017

Thanks for this contribution, @adyach.

As you probably guessed, the reason for the original design here is that Kafka actually has no "generic" error response messages that can be deserialized without knowledge of the RPC type. So every RPC error has to be handled by RPC-type-specific code, even if the handling is something that should be common to all RPCs.

So the original intention was that all errors should be handled by each Call subclass, in the Call#handleResponse method. However, it seems like the Call#handleResponse methods are indeed not handling a lot of retriable exceptions in practice, and that's something we need to fix.

It might seem obvious that the solution to receiving a retriable exception is always to retry, as this patch does. But unfortunately, that is not quite true. I would divide up the RetriableExceptions like so:

  1. RETRY AFTER DELAY: CoordinatorLoadInProgressException, GroupLoadInProgressException, CoordinatorNotAvailableException, RetriableCommitFailedException, NoAvailableBrokersException, LeaderNotAvailableException, NetworkException (not sure if broker ever returns this in a response)

  2. REFRESH METADATA AND RERUN NODEPROVIDER: NotControllerException, NotCoordinatorForGroupException, NotCoordinatorException, NotLeaderForPartitionException, StaleMetadataException, LeaderNotAvailableException

  3. RETURN ERROR IMMEDIATELY: CorruptRecordException, KafkaStorageException, TimeoutException (when returned by the broker in a response), NotEnoughReplicasException, NotEnoughReplicasAfterAppendException

Retrying the exceptions in group number 2 without refreshing the metadata will not help. The exceptions in group 3 should not be retried automatically.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 8, 2017

@cmccabe The current design is that all retriable exceptions should be in categories 1 or 2. I'm a bit unclear on why you have some of the exceptions in 3. For example, NotEnoughReplicasException is usually a transient state and hence why we normally want to retry (until a timeout).

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Nov 10, 2017

The current design is that all retriable exceptions should be in categories 1 or 2. I'm a bit unclear on why you have some of the exceptions in 3.

Here's one example that I think we can agree on: if someone invokes desribeTopic on a topic we don't know about, we should return UnknownTopicOrPartitionException immediately, even though UTPE is a subclass of RetriableException. Otherwise, anyone someone wants to describe a topic that doesn't exist, they have to wait for the full AdminClient timeout to kick in to tell them it doesn't exist. This would be bad for a lot of existing users who rely on polling loops to wait for topics to be created.

For example, NotEnoughReplicasException is usually a transient state and hence why we normally want to retry (until a timeout).

Well, the Producer should certainly retry when it gets this exception. I don't think AdminClient can currently receive this exception, so it was a bad example on my part. Similarly, I don't think AdminClient can receive CorruptRecordException or KafkaStorageException currently, either. I just included them in category 3 because I can't think of any reason why retrying would help (we have no process which fixes on-disk corruption right now).

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Dec 5, 2017

Unfortunately, I don't think the approach in the current patch is going to work. The problem is that most of our RPCs are batch requests, and each element in the batch can fail or succeed independently.

For example, let's say you have a DeleteRecordsRequest that contains a batch of deletes. You send it to a broker. It turns out the 9 of the 10 deletes can be processed by that broker. The final one gets a NotLeaderForPartitionException. In this scenario, the current patch will retry the whole request, which is clearly incorrect.

If AbstractResponse had a top-level error code, we could look at that and see if it was retriable. That would help for scenarios like NotControllerException. If one part of the request can't be processed because of NotControllerException, almost certainly the whole request can't be processed. Unfortunately, AbstractResponse does not have a top-level error code. Looking at AbstractResponse#errorCounts is not a reasonable substitute, because errorCounts does not tell you if there are any parts of the request that succeeded.

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Dec 6, 2017

See #4295 for another approach, which I think will work better.

@hachikuji
Copy link
Copy Markdown
Contributor

Closing this since the issue was resolved in a separate patch.

@hachikuji hachikuji closed this May 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants