KAFKA-9274: Remove `retries` from InternalTopicManager by mjsax · Pull Request #9060 · apache/kafka

mjsax · 2020-07-23T04:06:04Z

part of KIP-572
replace retries in InternalTopicManager with infinite retires plus
a new timeout, based on consumer config MAX_POLL_INTERVAL_MS
if the new timeout hits, we don't throw StreamsException any longer (as we did when exceeding retries), but send INCOMPLETE_SOURCE_TOPIC_METADATA error code to let all instances shut down

Third PR for KIP-572 (cf #8864 and #9047)

Call for review @vvcephei

mjsax · 2020-07-23T04:16:07Z

Minor side fix: fetchEndOffset can never throw TimeoutException because it catches all RuntimeException and convert them into StreamsException already

mjsax · 2020-07-23T04:16:46Z

first TODO removed

mjsax · 2020-07-23T04:17:05Z

Second TODO removed

mjsax · 2020-07-23T04:17:51Z

Third TODO removed -- as we don't pass retires and retry.backoff.ms via admin config any longer, this test is not needed any more

mjsax · 2020-07-23T04:18:52Z

Fourth TODO removed

mjsax · 2020-07-23T04:19:55Z

@vvcephei this is an open question

\cc @guozhangwang

I'm thinking that moving forward we should try to not create internal topics during rebalance but try pre-create in starting, but for now assuming this is still the case I think letting the whole application to die is fine --- i.e. treat it the same as source topics. Hence I'm leaning towards encoding INCOMPLETE_SOURCE_TOPIC_METADATA to shutdown the whole app, across all clients.

mjsax · 2020-07-23T04:57:26Z

Triggered system tests: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4071/

dajac · 2020-07-23T07:20:38Z

If we want to guarantee that the deadlineMs is respected, I think that we must set the timeout of the AdminClient's call accordingly: CreateTopicsOptions.timeoutMs. With the default, I think that the call could be longer than half of MAX_POLL_INTERVAL_MS_CONFIG.

Good question. Default max.poll.interval.ms is 5 minutes (ie, the deadline is set to 2.5 minutes by default) while default api.default.timeout.ms is 1 minutes? Thus we might be ok?

That's right. I misread the default value of max.poll.interval.ms, too many zeros for my eyes ;). The default works fine then. Do we want to protect ourselves if the user changes the default? Or shall we just call out that api.default.timeout.ms should be lower than max.poll.interval.ms somewhere?

I am happy to add a check in StreamsConfig and either throw or log a WARN depending how strict we want to be.

Thinking a bit more about this, with the default, you may end up not honouring the deadline. createTopics can take up to 1m so if you invoke one when less than 1m is reaming before the deadline, you may not honour the deadline. It may not be that important though.

If we want to strictly enforce it, we could calculate the maximum timeout for each call, something like deadline - now, and set it with CreateTopicsOptions.timeoutMs.

mjsax · 2020-07-23T17:28:14Z

Only the StreamsStandbyTask.test_standby_tasks_rebalance system test failed and it's know to be buggy.

mjsax · 2020-07-23T17:30:13Z

Jenkins failed on know flaky tests only.

abbccdda

Overall LGTM

abbccdda · 2020-08-02T06:53:41Z

Why do we reduce max poll interval?

In this PR, we change the "deadline" in the group leader to create/verify all internal topics from "counting retries" to a timeout of max.poll.interval.ms / 2 and we reduce the default of 5 minutes to speed up this test.

cf https://github.com/apache/kafka/pull/9060/files#diff-d3963e433c59b08688bb4481faa20e97R79

abbccdda · 2020-08-04T01:28:52Z

Also, do we have new unit test coverage for the changes?

mjsax · 2020-08-04T03:22:32Z

Updated. Call for review @abbccdda

Also, do we have new unit test coverage for the changes?

We did not really change much, only switching from retries to a "timeout" and thus existing unit tests (that I updated accordingly, eg, different configs and/or different exception type) should be sufficient?

The only other thing is the new error code and I just added 2 unit tests for this case.

abbccdda · 2020-08-04T18:20:44Z

Is this part of the initiative to throw a different exception? Could we update the summary of this PR?

abbccdda · 2020-08-04T18:22:05Z

Could we update the meta comment to explain when a TaskAssignmentException is thrown?

TaskAssignmentException is not a checked exception and it's just a curtesy declaration... (I can also remove throws TaskAssignmentException if you prefer).

Don't see any need to document anything further -- the code makes it clear (in fact, this method does not even throw the exception itself, but it just bubbles up from internalTopicManager.makeReady and the code documents itself:
https://github.com/apache/kafka/pull/9060/files#diff-d3963e433c59b08688bb4481faa20e97R179-R184

abbccdda · 2020-08-04T18:23:05Z

Comments for thrown exception

abbccdda · 2020-08-04T18:31:20Z

s/temporary/temporarily

mjsax · 2020-08-04T21:09:35Z

@abbccdda I updated the PR. I also realized, that it might be better to throw a TimeoutException instead of a TaskAssignmentException if we hit the new timeout. Updated the code and tests accordingly.

abbccdda

Just one minor comment.

abbccdda · 2020-08-05T20:58:19Z

Since we could throw different exceptions here, would be good to add a log to indicate which type of exception is thrown.

There are already corresponding log.error statement before those exceptions are thrown. No need to double log IMHO?

- part of KIP-572 - replace `retries` in InternalTopicManager with infinite retires plus a new timeout, based on consumer config MAX_POLL_INTERVAL_MS

abbccdda

LGTM

ableegoldman · 2020-09-16T16:47:24Z

+        // need to add mandatory configs; otherwise `QuietConsumerConfig` throws
+        consumerConfig.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class);
+        consumerConfig.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class);
+        retryTimeoutMs = new QuietConsumerConfig(consumerConfig).getInt(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG) / 2L;


Hey @mjsax , am I reading this PR correctly? Do we now only allow a single member to retry topic creation/validation for up to half of the poll interval, after which we shut down the entire application? That sounds like the opposite of resiliency...what if the brokers are temporarily unavailable? Before this we would just let the single thread die, and the internal topic creation/validation would be retried on the subsequent rebalance. That wasn't ideal, but given the upcoming work to allow reviving/recreating a death thread, that seems to be preferable to permanently ending the application?

Sorry if I'm misreading this, was just going over all the PRs in the last month or so to produce a diff+summary of the important ones, and want to make sure I actually understand all the changes we've made

Apologies if this was touched on in the KIP, it's been a while and the discussion thread was quite long so I may have missed something there

Note that the previous default was "zero retries" and thus the new default is more resilient with a 5 minute default max.poll.interval. -- But yes, we shutdown the whole app for this case now as proposed by @guozhangwang (IIRC).

Today since we do not have ways to partially create tasks we'd have to create all topics to make sure all tasks are "complete" within each rebalance, if we cannot successfully create the topics within the poll.interval (i.e. we'd need to complete that rebalance with the poll.interval, and I guess halving it is to be more conservative), then killing that thread is not very useful anyways since we cannot proceed with the initializable tasks anyways.

That being said, with the upcoming work I'd agree that just shutdown the thread and allow users to optionally retry rebalance with new threads would be preferrable.

cc @cadonna @wcarlson5 to bring to your attention.

Yeah I think we should remove the shutdown error code in case of TimeoutException during internal topic validation before 2.7. I'll create a ticket so we don't lose track -- I think even just letting it kill the one thread is better than killing all of them

mjsax added streams kip Requires or implements a KIP labels Jul 23, 2020

mjsax mentioned this pull request Jul 23, 2020

KAFKA-9274: Mark retries config as deprecated and add new task.timeout.ms config #8864

Merged

mjsax commented Jul 23, 2020

View reviewed changes

Comment thread streams/src/test/java/org/apache/kafka/streams/StreamsConfigTest.java Outdated

Copy link
Copy Markdown

Member Author

mjsax Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second TODO removed

mjsax commented Jul 23, 2020

View reviewed changes

mjsax changed the title ~~KAFKA-9274: Remove retries from InternalTopicManager~~ KAFKA-9274: Remove retries from InternalTopicManager Jul 23, 2020

dajac reviewed Jul 23, 2020

View reviewed changes

abbccdda reviewed Aug 2, 2020

View reviewed changes

abbccdda reviewed Aug 4, 2020

View reviewed changes

abbccdda reviewed Aug 5, 2020

View reviewed changes

mjsax added 5 commits August 5, 2020 14:16

KAFKA-9274: Remove retries from InternalTopicManager

ddbfe31

- part of KIP-572 - replace `retries` in InternalTopicManager with infinite retires plus a new timeout, based on consumer config MAX_POLL_INTERVAL_MS

fix tests

a214111

Github comments

0af6273

Github comments

8d23484

Rebased

05b0faa

mjsax force-pushed the kafka-9274-kip-572-internal branch from bf49314 to 05b0faa Compare August 5, 2020 21:30

abbccdda approved these changes Aug 5, 2020

View reviewed changes

mjsax merged commit 9903013 into apache:trunk Aug 6, 2020

mjsax deleted the kafka-9274-kip-572-internal branch August 7, 2020 03:48

ableegoldman reviewed Sep 16, 2020

View reviewed changes

Conversation

mjsax commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dajac Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Jul 23, 2020

Uh oh!

mjsax commented Jul 23, 2020

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abbccdda commented Aug 4, 2020

Uh oh!

mjsax commented Aug 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Aug 4, 2020

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Jul 23, 2020 •

edited

Loading

mjsax commented Jul 23, 2020 •

edited

Loading

dajac Jul 23, 2020 •

edited

Loading

mjsax Aug 4, 2020 •

edited

Loading

mjsax Aug 4, 2020 •

edited

Loading