Skip to content

KAFKA-9274: Gracefully handle timeout exception#8060

Merged
guozhangwang merged 2 commits intoapache:trunkfrom
guozhangwang:K9274-timeout-exception
Feb 15, 2020
Merged

KAFKA-9274: Gracefully handle timeout exception#8060
guozhangwang merged 2 commits intoapache:trunkfrom
guozhangwang:K9274-timeout-exception

Conversation

@guozhangwang
Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang commented Feb 7, 2020

  1. Delay the initialization (producer.initTxn) from construction to maybeInitialize; if it times out we just swallow and retry in the next iteration.

  2. If completeRestoration (consumer.committed) times out, just swallow and retry in the next iteration.

  3. For other calls (producer.partitionsFor, producer.commitTxn, consumer.commit), treat the timeout exception as fatal.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments and questions

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the semantic, we should update the class JavaDocs accordingly -- also, do we need to state that "It means all tasks belonging to this thread have been migrated" ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to clear the store now but not before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because before we never "revive" a task: once it's closed it's dead; but now it is possible and we may re-initialize the stores and if we do not clear it would cause an illegal-state exception.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well -- we could also "buffer" the record and try to send it later? In the mean time we would need to pause the corresponding task though to not process more input records (or course, we would need to let the task finish processing the current input record what might lead to more output records that we would need to buffer, too). -- This is just a wild thought and we could also handle this case later if required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the buffering mechanism here, but decided it may not worth since we've not seen timeout from partitionsFor -- it should be quite rare because in most cases the producer already got the partition metadata cached locally. If we found this call timing out become an issue we can revisit the buffering, wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the consumer not log this already within assign(...) ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it does not log the "delta" :) Joking aside, I found that logging the added / removed partitions are important and making the debugging easier.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above (this question comes up for more log statements below -- don't add a comment each time)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm for pause / resume I think I buy your argument -- I can remove these two from the changelog reader.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thoughts I feel it's better to still keep it since in many cases (e.g. in unit test trouble shooting) we usually only enable debugging on sub-packages like o.a.k.streams instead of everything, so we cannot always rely on the embedded client's log4j entries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we throw TimeoutException directly but not wrap it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because on the caller TaskManager we would swallow TimeoutException anyways.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent do we skip? Seems the method just executes as always?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We skip adding records to it. After reviewed your PR I think I would refactor this logic a bit more, stay tuned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this logic from addRecords to isProcessible so we would still add records for closing tasks but would skip processing them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to distinguish StreamsException and KafkaException (StreamsException is a KafkaException and both are fatal)`?

Actually similar question about KafkaException and Exception? The different error messages don't seems to provide much value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good question.. My original plan is that we should not expect KafkaException to be thrown here since we already wrap all of them as StreamsException so if there's any thrown it may be a bug. But since either case we re-throw it anyways now I'm not feeling so strong (I've also thinking if we just do not re-throw exception if StreamsException since it is expected, but that is a behavior change since user's registered handler would not trigger then). I can just collapse all of them into one capture and just log the actual exception as well, wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this check and add final else that throws an IllegalStateException instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds better :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only remove the task if it was active now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously also only remove the task (#8058 (comment)), the original motivation to only remove the task is that standby tasks are likely to go back and even if they do not they can still be closed in the next rebalance's onAssignment call, so we just keep them a bit longer hoping it worth the "wait" :)

@guozhangwang guozhangwang force-pushed the K9274-timeout-exception branch from 5adea85 to 74cfbf1 Compare February 14, 2020 21:30
@guozhangwang guozhangwang changed the title KAFKA-9274: Gracefully handle timeout exception [WIP] KAFKA-9274: Gracefully handle timeout exception Feb 14, 2020
client.prepareResponse(initProducerIdResponse(1L, (short) 5, Errors.NONE));

// retry initialization should work
producer.initTransactions();
Copy link
Copy Markdown
Contributor Author

@guozhangwang guozhangwang Feb 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to verify that initTransactions can indeed be retried. cc @hachikuji

Copy link
Copy Markdown
Contributor Author

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted the timeout handling logic from #8058.

@mjsax mjsax added the streams label Feb 15, 2020
@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 15, 2020

LGTM.

There are still a few cases in the GlobalStateManager for which we don't handle TimeoutException -- also, in GlobalStateManager we have a limited number of retries -- wondering if we should make sure the handle timeouts consistently? But this would be a follow up PR.

@guozhangwang
Copy link
Copy Markdown
Contributor Author

@mjsax thanks for pointing it out, I plan to do another cleanup to merge GlobalStateManager into ProcessorStateManager (there are still some TODOs marked here e.g. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StoreChangelogReader.java#L615) and then these should be resolved.

@guozhangwang guozhangwang merged commit d8756e8 into apache:trunk Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants