KAFKA-5950: AdminClient should retry based on returned error codes by adyach · Pull Request #4167 · apache/kafka

adyach · 2017-11-01T16:11:07Z

The PR will add retries for KafkaAdminClient's retriable error responses.

asfgit · 2017-11-01T16:18:17Z

FAILURE
No test results found.
--none--

asfgit · 2017-11-01T16:19:04Z

FAILURE
No test results found.
--none--

asfgit · 2017-11-01T16:19:20Z

FAILURE
No test results found.
--none--

ijuma · 2017-11-01T16:28:51Z

cc @cmccabe

asfgit · 2017-11-03T09:31:03Z

FAILURE
No test results found.
--none--

asfgit · 2017-11-03T10:24:25Z

FAILURE
7942 tests run, 5 skipped, 4 failed.
--none--

asfgit · 2017-11-03T10:26:00Z

FAILURE
7942 tests run, 5 skipped, 4 failed.
--none--

cmccabe · 2017-11-06T21:33:23Z

Thanks for this contribution, @adyach.

As you probably guessed, the reason for the original design here is that Kafka actually has no "generic" error response messages that can be deserialized without knowledge of the RPC type. So every RPC error has to be handled by RPC-type-specific code, even if the handling is something that should be common to all RPCs.

So the original intention was that all errors should be handled by each Call subclass, in the Call#handleResponse method. However, it seems like the Call#handleResponse methods are indeed not handling a lot of retriable exceptions in practice, and that's something we need to fix.

It might seem obvious that the solution to receiving a retriable exception is always to retry, as this patch does. But unfortunately, that is not quite true. I would divide up the RetriableExceptions like so:

RETRY AFTER DELAY: CoordinatorLoadInProgressException, GroupLoadInProgressException, CoordinatorNotAvailableException, RetriableCommitFailedException, NoAvailableBrokersException, LeaderNotAvailableException, NetworkException (not sure if broker ever returns this in a response)
REFRESH METADATA AND RERUN NODEPROVIDER: NotControllerException, NotCoordinatorForGroupException, NotCoordinatorException, NotLeaderForPartitionException, StaleMetadataException, LeaderNotAvailableException
RETURN ERROR IMMEDIATELY: CorruptRecordException, KafkaStorageException, TimeoutException (when returned by the broker in a response), NotEnoughReplicasException, NotEnoughReplicasAfterAppendException

Retrying the exceptions in group number 2 without refreshing the metadata will not help. The exceptions in group 3 should not be retried automatically.

ijuma · 2017-11-08T16:09:59Z

@cmccabe The current design is that all retriable exceptions should be in categories 1 or 2. I'm a bit unclear on why you have some of the exceptions in 3. For example, NotEnoughReplicasException is usually a transient state and hence why we normally want to retry (until a timeout).

cmccabe · 2017-11-10T02:14:45Z

The current design is that all retriable exceptions should be in categories 1 or 2. I'm a bit unclear on why you have some of the exceptions in 3.

Here's one example that I think we can agree on: if someone invokes desribeTopic on a topic we don't know about, we should return UnknownTopicOrPartitionException immediately, even though UTPE is a subclass of RetriableException. Otherwise, anyone someone wants to describe a topic that doesn't exist, they have to wait for the full AdminClient timeout to kick in to tell them it doesn't exist. This would be bad for a lot of existing users who rely on polling loops to wait for topics to be created.

For example, NotEnoughReplicasException is usually a transient state and hence why we normally want to retry (until a timeout).

Well, the Producer should certainly retry when it gets this exception. I don't think AdminClient can currently receive this exception, so it was a bad example on my part. Similarly, I don't think AdminClient can receive CorruptRecordException or KafkaStorageException currently, either. I just included them in category 3 because I can't think of any reason why retrying would help (we have no process which fixes on-disk corruption right now).

cmccabe · 2017-12-05T21:04:53Z

Unfortunately, I don't think the approach in the current patch is going to work. The problem is that most of our RPCs are batch requests, and each element in the batch can fail or succeed independently.

For example, let's say you have a DeleteRecordsRequest that contains a batch of deletes. You send it to a broker. It turns out the 9 of the 10 deletes can be processed by that broker. The final one gets a NotLeaderForPartitionException. In this scenario, the current patch will retry the whole request, which is clearly incorrect.

If AbstractResponse had a top-level error code, we could look at that and see if it was retriable. That would help for scenarios like NotControllerException. If one part of the request can't be processed because of NotControllerException, almost certainly the whole request can't be processed. Unfortunately, AbstractResponse does not have a top-level error code. Looking at AbstractResponse#errorCounts is not a reasonable substitute, because errorCounts does not tell you if there are any parts of the request that succeeded.

cmccabe · 2017-12-06T00:43:26Z

See #4295 for another approach, which I think will work better.

hachikuji · 2019-05-11T07:47:32Z

Closing this since the issue was resolved in a separate patch.

KAFKA-5950: retry on response retriable exception

ec46619

adyach changed the title ~~KAFKA-5950: retry on response retriable exception~~ KAFKA-5950: AdminClient should retry based on returned error codes Nov 1, 2017

KAFKA-5950: fixed code style issue

1a3421e

hachikuji closed this May 11, 2019

Conversation

adyach commented Nov 1, 2017

Uh oh!

asfgit commented Nov 1, 2017

Uh oh!

asfgit commented Nov 1, 2017

Uh oh!

asfgit commented Nov 1, 2017

Uh oh!

ijuma commented Nov 1, 2017

Uh oh!

asfgit commented Nov 3, 2017

Uh oh!

asfgit commented Nov 3, 2017

Uh oh!

asfgit commented Nov 3, 2017

Uh oh!

cmccabe commented Nov 6, 2017

Uh oh!

ijuma commented Nov 8, 2017

Uh oh!

cmccabe commented Nov 10, 2017

Uh oh!

cmccabe commented Dec 5, 2017

Uh oh!

cmccabe commented Dec 6, 2017

Uh oh!

hachikuji commented May 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants