KAFKA-9274: Gracefully handle timeout exception by guozhangwang · Pull Request #8060 · apache/kafka

guozhangwang · 2020-02-07T18:32:39Z

Delay the initialization (producer.initTxn) from construction to maybeInitialize; if it times out we just swallow and retry in the next iteration.
If completeRestoration (consumer.committed) times out, just swallow and retry in the next iteration.
For other calls (producer.partitionsFor, producer.commitTxn, consumer.commit), treat the timeout exception as fatal.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

mjsax

Some initial comments and questions

mjsax · 2020-02-10T01:39:49Z

If we change the semantic, we should update the class JavaDocs accordingly -- also, do we need to state that "It means all tasks belonging to this thread have been migrated" ?

mjsax · 2020-02-10T01:41:24Z

Why do we need to clear the store now but not before?

Because before we never "revive" a task: once it's closed it's dead; but now it is possible and we may re-initialize the stores and if we do not clear it would cause an illegal-state exception.

mjsax · 2020-02-10T01:47:14Z

Well -- we could also "buffer" the record and try to send it later? In the mean time we would need to pause the corresponding task though to not process more input records (or course, we would need to let the task finish processing the current input record what might lead to more output records that we would need to buffer, too). -- This is just a wild thought and we could also handle this case later if required.

I thought about the buffering mechanism here, but decided it may not worth since we've not seen timeout from partitionsFor -- it should be quite rare because in most cases the producer already got the partition metadata cached locally. If we found this call timing out become an issue we can revisit the buffering, wdyt?

mjsax · 2020-02-10T01:49:23Z

Does the consumer not log this already within assign(...) ?

Yes but it does not log the "delta" :) Joking aside, I found that logging the added / removed partitions are important and making the debugging easier.

mjsax · 2020-02-10T01:49:35Z

Same question as above (this question comes up for more log statements below -- don't add a comment each time)

Hmm for pause / resume I think I buy your argument -- I can remove these two from the changelog reader.

After some more thoughts I feel it's better to still keep it since in many cases (e.g. in unit test trouble shooting) we usually only enable debugging on sub-packages like o.a.k.streams instead of everything, so we cannot always rely on the embedded client's log4j entries.

mjsax · 2020-02-10T01:56:41Z

Why do we throw TimeoutException directly but not wrap it?

Because on the caller TaskManager we would swallow TimeoutException anyways.

mjsax · 2020-02-10T01:58:04Z

To what extent do we skip? Seems the method just executes as always?

We skip adding records to it. After reviewed your PR I think I would refactor this logic a bit more, stay tuned.

I moved this logic from addRecords to isProcessible so we would still add records for closing tasks but would skip processing them.

mjsax · 2020-02-10T01:59:17Z

Do we need to distinguish StreamsException and KafkaException (StreamsException is a KafkaException and both are fatal)`?

Actually similar question about KafkaException and Exception? The different error messages don't seems to provide much value?

Yeah that's a good question.. My original plan is that we should not expect KafkaException to be thrown here since we already wrap all of them as StreamsException so if there's any thrown it may be a bug. But since either case we re-throw it anyways now I'm not feeling so strong (I've also thinking if we just do not re-throw exception if StreamsException since it is expected, but that is a behavior change since user's registered handler would not trigger then). I can just collapse all of them into one capture and just log the actual exception as well, wdyt?

mjsax · 2020-02-10T02:01:38Z

Should we keep this check and add final else that throws an IllegalStateException instead?

Yeah that sounds better :)

mjsax · 2020-02-10T02:06:06Z

Why do we only remove the task if it was active now?

We previously also only remove the task (#8058 (comment)), the original motivation to only remove the task is that standby tasks are likely to go back and even if they do not they can still be closed in the next rebalance's onAssignment call, so we just keep them a bit longer hoping it worth the "wait" :)

guozhangwang · 2020-02-14T22:49:14Z

+            client.prepareResponse(initProducerIdResponse(1L, (short) 5, Errors.NONE));
+
+            // retry initialization should work
+            producer.initTransactions();


This is just to verify that initTransactions can indeed be retried. cc @hachikuji

guozhangwang

Extracted the timeout handling logic from #8058.

mjsax · 2020-02-15T01:14:29Z

LGTM.

There are still a few cases in the GlobalStateManager for which we don't handle TimeoutException -- also, in GlobalStateManager we have a limited number of retries -- wondering if we should make sure the handle timeouts consistently? But this would be a follow up PR.

guozhangwang · 2020-02-15T01:18:23Z

@mjsax thanks for pointing it out, I plan to do another cleanup to merge GlobalStateManager into ProcessorStateManager (there are still some TODOs marked here e.g. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StoreChangelogReader.java#L615) and then these should be resolved.

mjsax reviewed Feb 10, 2020

View reviewed changes

timeout for committed and initTxn

74cfbf1

guozhangwang force-pushed the K9274-timeout-exception branch from 5adea85 to 74cfbf1 Compare February 14, 2020 21:30

guozhangwang changed the title ~~KAFKA-9274: Gracefully handle timeout exception [WIP]~~ KAFKA-9274: Gracefully handle timeout exception Feb 14, 2020

improve producer unit test

1933257

guozhangwang commented Feb 14, 2020

View reviewed changes

mjsax added the streams label Feb 15, 2020

guozhangwang merged commit d8756e8 into apache:trunk Feb 15, 2020

Conversation

guozhangwang commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang Feb 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

mjsax commented Feb 15, 2020

Uh oh!

guozhangwang commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guozhangwang commented Feb 7, 2020 •

edited

Loading

guozhangwang Feb 14, 2020 •

edited

Loading