KAFKA-6897: Prevent producer from blocking indefinitely after close by dhruvilshah3 · Pull Request #5027 · apache/kafka

dhruvilshah3 · 2018-05-17T01:46:48Z

After successful completion of KafkaProducer#close, it is possible that an application calls KafkaProducer#send. If the send is invoked for a topic for which we do not have any metadata, the producer will block until max.block.ms elapses - we do not expect to receive any metadata update in this case because Sender (and NetworkClient) has already exited. It is only when RecordAccumulator#append is invoked that we notice that the producer has already been closed and throw an exception. If max.block.ms is set to Long.MaxValue (or a sufficiently high value in general), the producer could block awaiting metadata indefinitely.

This patch makes sure Metadata#awaitUpdate periodically checks if the network client has been closed, and if so bails out as soon as possible.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

cmccabe · 2018-05-17T17:16:26Z

The concept seems good. I'm not sure that IllegalStateException is the right one to throw, since the state is legal and reachable. Maybe something like TimeoutException?

hachikuji · 2018-05-17T17:17:21Z

Do we need to overload version for this purpose? Would a separate flag work? Maybe a more conventional approach would just use a close() method and have an isClosed field or something like that?

I thought it might be a bit awkward to have a close() method for a class like Metadata that has no underlying resources really. That said, I also introduced close() for MetadataUpdater so may be we should do the same for Metadata as well. I will update the PR.

dhruvilshah3 · 2018-05-17T18:08:18Z

@cmccabe I agree that IllegalStateException is probably not the most appropriate one to throw in this case. I was following along with whatever other KafkaProducer methods do when invoked after close - for example, RecordAccumulator#abortBatches and RecordAccumulator#append throw IllegalStateException as well. May be we should change all of these to throw a more suitable exception in a subsequent PR?

hachikuji · 2018-05-17T18:49:14Z

Probably not a big deal since awaitUpdate will not block in wait for longer than 100ms, but maybe we may as well call notifyAll()?

Should Metadata extend Closeable?

hachikuji · 2018-05-17T18:51:02Z

I am wondering if we should throw KafkaException. This is an expected state since the producer is designed to block in send() to await metadata and there is not really any way for a user to avoid it. To be consistent, we could also raise KafkaException from RecordAccumulator in the similar scenario.

yeah, the existing IllegalStateException is confusing and we should fix it.

dhruvilshah3 · 2018-05-17T21:52:58Z

Retest this please

hachikuji

Thanks, left a few comments. Also note the build failure. Might be a good idea to rebase and verify that the patch is still compiling.

hachikuji · 2018-05-18T00:39:50Z

Not sure it makes sense to change this one. In fact, close() should really be idempotent, so maybe we can just remove this check?

hachikuji · 2018-05-18T21:07:18Z

I think it's fine to leave this unchanged since it is only invoked at the start of the mock producer apis.

Probably not a big deal, but I do see this being called from MockProducer#send for example so it might be worth keeping things consistent by throwing KafkaException as we do when KafkaProducer#send is called after close.

hachikuji · 2018-05-18T21:09:03Z

Not a big deal, but perhaps we could let MetadataUpdater implement Closeable? We can still override close() so that it doesn't throw an exception.

hachikuji · 2018-05-18T21:10:15Z

Don't we need to update this?

Related to the other comment - Metadata#update throws IllegalStateException when invoked after close.

hachikuji · 2018-05-18T21:12:45Z

Since we have the notify in close(), do we still need this change?

Good point, reverted.

hachikuji · 2018-05-18T21:17:20Z

Should we use KafkaException here as well?

Hmm, I feel IllegalStateException is more appropriate in this case. We expect NetworkClient to not invoke Metadata#update after it has called Metadata#close().

hachikuji · 2018-05-18T21:19:34Z

This message seems a little low level for something which will get propagated back to the user. An alternative to consider would be to let awaitUpdate return a boolean indicating whether the update happened or not. That would allow us to raise an exception with a producer-specific message from send().

I am now catching this exception in KafkaProducer#send and rethrowing with a more appropriate message.

hachikuji · 2018-05-18T21:20:07Z

Does this need to be public?

I have KafkaProducer calling into this now, so needs to be public.

hachikuji · 2018-05-18T21:23:53Z

Hmm.. The old logic would let us continue fetching in the case of a timeout. Do you think that was not intentional?

Even if we continued fetching, we would have failed the test at the end. tearDown() checks if we saw any background errors and fails the test if we did, so I thought that this change should be reasonable. Let me know if you think otherwise.

hachikuji · 2018-07-02T22:50:56Z

Discussed offline, but we should try and distinguish legitimate illegal state errors when a producer method is called after close returns.

hachikuji

Sorry for the delay. Left a few more comments.

hachikuji · 2018-07-11T18:58:22Z

Can you explain why we need to catch this? It's generally a bad practice to ignore interrupts, so usually we either let the exception propagate or we reset the interrupt so that the caller has a chance to observe it.

I am rethrowing as KafkaException if the interrupt was because of producer close; close() calls notifyAll() which could interrupt the wait() in this method. Does this seem reasonable?

The problem is that we are losing the indication that the interrupt has occurred. A caller up the stack may depend on seeing it. I think I would just let the exception be raised in all cases even if the producer is being closed.

hachikuji · 2018-07-11T19:04:44Z

Should Metadata extend Closeable?

hachikuji · 2018-07-11T19:05:19Z

nit: "after producer has been closed"?

hachikuji · 2018-07-11T19:08:41Z

Should we chain the caught exception? We expect this to be the close exception, but it could also be a timeout or an authentication failure. Might be useful to know in some scenarios.

hachikuji · 2018-07-11T19:48:42Z

Since it's trace level anyway, maybe we should just print the stacktrace instead of just the message.

hachikuji · 2018-07-11T19:51:47Z

I know it is from the original code, but asserting the error message seems dubious. Maybe we can just verify the exception type is KafkaException?

asfgit · 2018-07-20T07:22:50Z

FAILURE
9289 tests run, 5 skipped, 0 failed.
--none--

dhruvilshah3 · 2018-07-20T08:49:22Z

Retest this please

hachikuji

Thanks for the updates. Just one additional comment.

hachikuji · 2018-07-20T17:14:34Z

This feels a little brittle. If there is a delay before executing the task, then send() may raise the wrong exception. I think we could make it more reliable by waiting until the topic "test" has been added to Metadata.

Good point. There is still some degree of uncertainty (even if much smaller than before) so I retained the sleep.

Can you elaborate? I don't see any point in the code where we would return between adding the topic and awaiting the update.

hachikuji · 2018-07-20T17:57:39Z

Can we use waitForCondition?

hachikuji · 2018-07-20T17:59:31Z

Can you elaborate? I don't see any point in the code where we would return between adding the topic and awaiting the update.

hachikuji

LGTM. Thanks for the patch.

ijuma · 2018-07-21T04:21:13Z

@hachikuji is this ready to be merged? And is it just trunk or 2.0 as well?

hachikuji · 2018-07-21T21:01:28Z

@ijuma Yes, I will merge to trunk and 2.0.

… closed (#5027) After successful completion of KafkaProducer#close, it is possible that an application calls KafkaProducer#send. If the send is invoked for a topic for which we do not have any metadata, the producer will block until `max.block.ms` elapses - we do not expect to receive any metadata update in this case because Sender (and NetworkClient) has already exited. It is only when RecordAccumulator#append is invoked that we notice that the producer has already been closed and throw an exception. If `max.block.ms` is set to Long.MaxValue (or a sufficiently high value in general), the producer could block awaiting metadata indefinitely. This patch makes sure `Metadata#awaitUpdate` periodically checks if the network client has been closed, and if so bails out as soon as possible.

* apache-github/2.0: MINOR: Close ZooKeeperClient if waitUntilConnected fails during construction (apache#5411) KAFKA-6897; Prevent KafkaProducer.send from blocking when producer is closed (apache#5027)

dhruvilshah3 force-pushed the producer-close branch from 521ae07 to 594758e Compare May 17, 2018 01:56

dhruvilshah3 changed the title ~~KAFKA-6897: Prevent producer from blocking indefinitely on close~~ KAFKA-6897: Prevent producer from blocking indefinitely after close May 17, 2018

hachikuji reviewed May 17, 2018

View reviewed changes

dhruvilshah3 force-pushed the producer-close branch from 002525a to 2406694 Compare May 17, 2018 21:14

hachikuji reviewed May 18, 2018

View reviewed changes

dhruvilshah3 force-pushed the producer-close branch 2 times, most recently from ed4be25 to efd3e9a Compare May 18, 2018 23:21

dhruvilshah3 force-pushed the producer-close branch 4 times, most recently from 19794a5 to 8e47fe3 Compare July 3, 2018 20:37

hachikuji reviewed Jul 11, 2018

View reviewed changes

dhruvilshah3 force-pushed the producer-close branch from 1180a2b to 0cf48e4 Compare July 18, 2018 07:04

hachikuji reviewed Jul 20, 2018

View reviewed changes

Prevent KafkaProducer.send from blocking when producer is closed

a2ba5e2

dhruvilshah3 force-pushed the producer-close branch from cb3c69d to a2ba5e2 Compare July 20, 2018 20:23

hachikuji approved these changes Jul 20, 2018

View reviewed changes

hachikuji merged commit d11f6f2 into apache:trunk Jul 21, 2018

dhruvilshah3 deleted the producer-close branch July 23, 2018 07:02

Conversation

dhruvilshah3 commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

cmccabe commented May 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 commented May 17, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Jul 2, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 commented May 17, 2018 •

edited

Loading

dhruvilshah3 commented May 17, 2018 •

edited

Loading

dhruvilshah3 May 18, 2018 •

edited

Loading