Skip to content

KAFKA-6299. Fix AdminClient error handling when metadata changes#4295

Merged
hachikuji merged 5 commits intoapache:trunkfrom
cmccabe:KAFKA-6299
May 9, 2018
Merged

KAFKA-6299. Fix AdminClient error handling when metadata changes#4295
hachikuji merged 5 commits intoapache:trunkfrom
cmccabe:KAFKA-6299

Conversation

@cmccabe
Copy link
Copy Markdown
Contributor

@cmccabe cmccabe commented Dec 6, 2017

AdminClient should only call Metadata#requestUpdate when needed.

When AdminClient gets a NOT_CONTROLLER error, it should refresh its metadata and retry the request, rather than making the end-user deal with NotControllerException.

Move AdminClient's metadata management outside of NetworkClient and into AdminMetadataManager. This will make it easier to do more sophisticated metadata management in the future, such as implementing a NodeProvider which fetches the leaders for topics.

Rather than manipulating newCalls directly, the AdminClient service thread now drains it directly into pendingCalls. This minimizes the amount of locking we have to do, since pendingCalls is only accessed from the service thread.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this if block be placed above the if block on line 729 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm-- I don't think so. We don't want to return the exception of the stale metadata, if new metadata has been requested.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add comment on why synchronization on callsToSend is not needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment to the declaration of callsToSend, since this is used in many other places.

@cmccabe cmccabe force-pushed the KAFKA-6299 branch 3 times, most recently from ccadd98 to 03be633 Compare December 15, 2017 19:33
@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Dec 18, 2017

Test failure was org.apache.kafka.common.security.authenticator.ClientAuthenticationFailureTest.testAdminClientWithInvalidCredentials, which is not related.

@hachikuji
Copy link
Copy Markdown
Contributor

@cmccabe The test failure does seem caused by this patch. I reproduced locally on this branch. Can you check again?

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Left a few comments/questions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a weird contract. Maybe the name should be checkMetadataFetchError or something like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkMetadataError might be a better name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? It seems like the only way it's possible for the if condition to be true is if we already updated lastSeenMetadata below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this if statement is that we shouldn't keep calling Metadata#requestUpdate over and over. We should call it once, wait for an update, and then call it again after that if needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not questioning the need of the if, but the reassignment to lastSeenMetadataVersion, which seemed redundant given how it is updated below. I may have answered my own question however. The problem is that the version in Metadata is updated in two additional cases: 1) on initial bootstrapping, and 2) after a periodic refresh. We handle the first case I think because of this reassignment and the fact that the two fields are initialized to 0.

I'm not sure about the second case, however. Say, for example, that lowestValidMetadataVersion and lastSeenMetadataVersion are both 5 and no update has been requested. After the metadata max age expires, we'll automatically trigger an update and bump lastSeenMetadataVersion to 6. Now if we request a metadata update, nothing will happen because we'll never reach equality again. We may be able to fix it by changing to an inequality, but I'm not sure. Some testing would be helpful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not that big of a deal, but since we do the same thing in several places, maybe we could add a requestMetadataUpdate method to the runnable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the NOT_CONTROLLER error possible for Metadata requests? Also, are there any other errors we care about at this level (e.g. NOT_LEADER errors)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not. Good catch.

"Not leader for partition" is something we care about for certain calls, but handling it will be harder. I want to do that in a follow-on change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I would have expected that these transient errors would get retried internally. Is the intent to do this separately?

Copy link
Copy Markdown
Contributor Author

@cmccabe cmccabe Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Unfortunately, they can't be retried internally because we have no generic top level error code. So if createTopics fails, you get back a response that has {foo : NOT_CONTROLLER, bar: NOT_CONTROLLER, baz: NOT_CONTROLLER, etc.} You can't parse this without knowing the subtype, which means it has to be done here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my question, does the user request fail because of a transient NOT_CONTROLLER error or do we retry the request? I had expected we would retry, but perhaps we are leaving that for future work?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that this works for all uses of version at the moment? It seems like a minor change in behavior since we use failedUpdate in DefaultMetadataUpdater if there are no nodes in the response. Currently this would cause uses such as ConsumerNetworkClient.awaitMetadataUpdate() to retry before returning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think retrying is what we want, though. If you get an authentication exception from any broker, your auth is bad and you should fail. It's not a retryable exception, conceptually or in terms of code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I am referring to is in DefaultMetadataupdater.handleCompletedMetadataResponse, which is different from the authentication failure path. We have a check to ensure that the metadata response contains at least one broker. With this change, we will update the metadata version even in the case that it is empty, which will cause methods like ConsumerNetworkClient.awaitMetadataUpdate() to return earlier than they currently do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Maybe I'm misinterpreting, but are you suggesting that ConsumerNetworkClient#awaitMetadataUpdate() should wait forever (or until the timeout hits) when there is an authentication error fetching the metadata? That doesn't seem right.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am definitely not suggesting that. I am not talking about authentication failures at all. I am referring to the call to Metadata.failedUpdate that is in DefaultMetadataupdater.handleCompletedMetadataResponse. With this change, the version will be incremented following an "empty" metadata update. This will cause us to exit the loop in ConsumerNetworkClient#awaitMetadataUpdate() even though we have not received a valid update.

My point more generally is that we are changing the meaning of the version field inside Metadata. It seems we were intentionally using this before to indicate when we had received a valid new version of the metadata. But now we bump it even when there's a failure which means we can no longer use it to tell when we've seen a successful update. As far as I can tell, the impact may be minor, but we should consider all of the usages to be sure of it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this was not really necessary since we just needed a sentinel value? Note that common/errors is a public package.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use NetworkException instead (which is a subclass of InvalidMetadataException)

@cmccabe cmccabe force-pushed the KAFKA-6299 branch 2 times, most recently from 10ed8a5 to 66bbf66 Compare April 13, 2018 22:14
env.kafkaClient().setNode(env.cluster().nodeById(0));
env.kafkaClient().prepareResponse(new CreateTopicsResponse(Collections.singletonMap("myTopic", new ApiError(Errors.NONE, ""))));
KafkaFuture<Void> future = env.adminClient().createTopics(
Collections.singleton(new NewTopic("myTopic", Collections.singletonMap(Integer.valueOf(0), asList(new Integer[]{0, 1, 2})))),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be simplified Collections.singletonMap(0, asList(0, 1, 2)

A few more of these in this file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not questioning the need of the if, but the reassignment to lastSeenMetadataVersion, which seemed redundant given how it is updated below. I may have answered my own question however. The problem is that the version in Metadata is updated in two additional cases: 1) on initial bootstrapping, and 2) after a periodic refresh. We handle the first case I think because of this reassignment and the fact that the two fields are initialized to 0.

I'm not sure about the second case, however. Say, for example, that lowestValidMetadataVersion and lastSeenMetadataVersion are both 5 and no update has been requested. After the metadata max age expires, we'll automatically trigger an update and bump lastSeenMetadataVersion to 6. Now if we request a metadata update, nothing will happen because we'll never reach equality again. We may be able to fix it by changing to an inequality, but I'm not sure. Some testing would be helpful.

@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Apr 23, 2018

Jenkins is flaking again due to out of memory errors launching git.

18:28:10 Caused by: java.lang.OutOfMemoryError: unable to create new native thread
18:28:10 	at java.lang.Thread.start0(Native Method)
18:28:10 	at java.lang.Thread.start(Thread.java:717)
18:28:10 	at hudson.Proc$LocalProc.<init>(Proc.java:269)
18:28:10 	at hudson.Proc$LocalProc.<init>(Proc.java:218)
18:28:10 	at hudson.Launcher$LocalLauncher.launch(Launcher.java:930)
18:28:10 	at hudson.Launcher$ProcStarter.start(Launcher.java:450)
18:28:10 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1992)
18:28:10 	... 15 more
18:28:10 ERROR: Error cloning remote repo 'origin'
18:28:10 Retrying after 10 seconds

@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Apr 23, 2018

retest this please

@cmccabe cmccabe force-pushed the KAFKA-6299 branch 2 times, most recently from 530d59b to 41b3f7a Compare May 2, 2018 18:26
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the approach using the custom metadata updater seems promising. Left some comments/questions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public package, which conventionally has implied that all of the classes in it are public as well. Should we have an internals package for stuff like this? This is the pattern we use for the consumer and producer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll create an internals package, to follow the convention.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comments like this seem like overkill

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to extend ManualMetadataUpdater. It already has most of the no-op functionality we want.

Copy link
Copy Markdown
Contributor Author

@cmccabe cmccabe May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my thought process here is that the broker is using ManualMetadataUpdater, and I don't want changes to ManualMetadataUpdater to change AdminClient. Since the (lack of?) functionality here is minimal, seems better just to create another class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use the other constructor?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear if we need this. Below when we see a disconnect, we check for authentication errors explicitly. Do we need anything else? It would be a little clearer if we only have one path for surfacing authentication errors.

Copy link
Copy Markdown
Contributor Author

@cmccabe cmccabe May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When NetworkClient gets a request, it sometimes has to make additional internal requests to fulfill it. For example, if NC gets asked to make an AlterConfigsRequest, it may first have to make an API versions request. However, if this API versions request fails, there is nothing which ties the failure back to the AlterConfigsRequest. Since it's an "internal" request, the failure disappears without a trace and never makes its way into the list of responses.

I would argue that this is a bug in NetworkClient. We are currently hacking around it by things like having the event loop manually iterate through each node in ClusterConnectionStates to see whether any of them ended up in ConnectionState.AUTHENTICATION_FAILED. The metadata updater is another hack which gets triggered even when the Response gets dropped, in NetworkClient#processDisconnection.

If we want to be a little braver, we could say that when handling an internal APIVersionRequest disconnect, we could also send a disconnection to the next non-internal request queued for that node.

Really the whole concept of an internal request is evil. We should just have a wrapper class around NetworkClient that translates one stream of requests into another. That would help us keep this straight. But that is too big to do right now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a concurrent modification?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the iterator removes the entry from pendingAuthenticationErrors via Iterator#remove, and then the call to pendingAuthenticationErrors#remove has no effect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name timeoutMs seems misleading. It's really the blackout period following the authentication error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will change it to blackoutMs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior if the user ignores the auth error and continues to use the AdminClient?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will continue to throw AuthenticationException.

After enough time, another metadata request may be made which may succeed, which would allow future requests to go through. But we don't spam metadata requests or anything-- if the auth exception is cleared, it will be because of a timeout.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not saving much..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the poll timeout take into account the time to the next metadata refresh? What happens if we are in the middle of the metadata backoff when a call is made?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. Will fix.

@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented May 9, 2018

retest this please

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the synchronization needed here since pendingCalls is only accessed by this thread?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed. Good catch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the second {} since the logging treats the exception specially.

cmccabe added 3 commits May 9, 2018 10:09
When AdminClient gets a NOT_CONTROLLER error, it should refresh its
metadata and retry the request, rather than making the end-user deal
with NotControllerException.

Move AdminClient's metadata management outside of NetworkClient and into
AdminMetadataManager.  This will make it easier to do more sophisticated
metadata management in the future, such as implementing a NodeProvider
which fetches the leaders for topics.

Rather than manipulating newCalls directly, the AdminClient service
thread now drains it directly into pendingCalls.  This minimizes the
amount of locking we have to do, since pendingCalls is only accessed
from the service thread.

public boolean isReady() {
if (authException != null) {
log.trace("Metadata is ready: got authentication exception.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "Metadata is not ready"?

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, LGTM. Will merge after the builds complete.

@hachikuji
Copy link
Copy Markdown
Contributor

retest this please

@hachikuji hachikuji merged commit abbd53d into apache:trunk May 9, 2018
ijuma added a commit to ijuma/kafka that referenced this pull request May 11, 2018
…-record-version

* apache-github/trunk:
  KAFKA-6894: Improve err msg when connecting processor with global store (apache#5000)
  KAFKA-6893; Create processors before starting acceptor in SocketServer (apache#4999)
  MINOR: Fix typo in ConsumerRebalanceListener JavaDoc (apache#4996)
  MINOR: Remove deprecated valueTransformer.punctuate (apache#4993)
  MINOR: Update dynamic broker configuration doc for truststore update (apache#4954)
  KAFKA-6870 Concurrency conflicts in SampledStat (apache#4985)
  KAFKA-6361: Fix log divergence between leader and follower after fast leader fail over (apache#4882)
  KAFKA-6813: Remove deprecated APIs in KIP-182, Part II (apache#4976)
  KAFKA-6878 Switch the order of underlying.init and initInternal (apache#4988)
  KAFKA-6299; Fix AdminClient error handling when metadata changes (apache#4295)
  KAFKA-6878: NPE when querying global state store not in READY state (apache#4978)
  KAFKA 6673: Implemented missing override equals method (apache#4745)
  KAFKA-6834: Handle compaction with batches bigger than max.message.bytes (apache#4953)
ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018
…che#4295)

When AdminClient gets a NOT_CONTROLLER error, it should refresh its metadata and retry the request, rather than making the end-user deal with NotControllerException.

Move AdminClient's metadata management outside of NetworkClient and into AdminMetadataManager. This will make it easier to do more sophisticated metadata management in the future, such as implementing a NodeProvider which fetches the leaders for topics.

Rather than manipulating newCalls directly, the AdminClient service thread now drains it directly into pendingCalls. This minimizes the amount of locking we have to do, since pendingCalls is only accessed from the service thread.
@cmccabe cmccabe deleted the KAFKA-6299 branch May 20, 2019 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants