KAFKA-9274: Remove retries from InternalTopicManager#9060
KAFKA-9274: Remove retries from InternalTopicManager#9060mjsax merged 5 commits intoapache:trunkfrom
retries from InternalTopicManager#9060Conversation
There was a problem hiding this comment.
Minor side fix: fetchEndOffset can never throw TimeoutException because it catches all RuntimeException and convert them into StreamsException already
There was a problem hiding this comment.
Third TODO removed -- as we don't pass retires and retry.backoff.ms via admin config any longer, this test is not needed any more
There was a problem hiding this comment.
@vvcephei this is an open question
\cc @guozhangwang
There was a problem hiding this comment.
I'm thinking that moving forward we should try to not create internal topics during rebalance but try pre-create in starting, but for now assuming this is still the case I think letting the whole application to die is fine --- i.e. treat it the same as source topics. Hence I'm leaning towards encoding INCOMPLETE_SOURCE_TOPIC_METADATA to shutdown the whole app, across all clients.
retries from InternalTopicManager
|
Triggered system tests: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4071/ |
There was a problem hiding this comment.
If we want to guarantee that the deadlineMs is respected, I think that we must set the timeout of the AdminClient's call accordingly: CreateTopicsOptions.timeoutMs. With the default, I think that the call could be longer than half of MAX_POLL_INTERVAL_MS_CONFIG.
There was a problem hiding this comment.
Good question. Default max.poll.interval.ms is 5 minutes (ie, the deadline is set to 2.5 minutes by default) while default api.default.timeout.ms is 1 minutes? Thus we might be ok?
There was a problem hiding this comment.
That's right. I misread the default value of max.poll.interval.ms, too many zeros for my eyes ;). The default works fine then. Do we want to protect ourselves if the user changes the default? Or shall we just call out that api.default.timeout.ms should be lower than max.poll.interval.ms somewhere?
There was a problem hiding this comment.
I am happy to add a check in StreamsConfig and either throw or log a WARN depending how strict we want to be.
There was a problem hiding this comment.
Thinking a bit more about this, with the default, you may end up not honouring the deadline. createTopics can take up to 1m so if you invoke one when less than 1m is reaming before the deadline, you may not honour the deadline. It may not be that important though.
If we want to strictly enforce it, we could calculate the maximum timeout for each call, something like deadline - now, and set it with CreateTopicsOptions.timeoutMs.
|
Only the |
|
Jenkins failed on know flaky tests only. |
There was a problem hiding this comment.
In this PR, we change the "deadline" in the group leader to create/verify all internal topics from "counting retries" to a timeout of max.poll.interval.ms / 2 and we reduce the default of 5 minutes to speed up this test.
cf https://github.com/apache/kafka/pull/9060/files#diff-d3963e433c59b08688bb4481faa20e97R79
|
Also, do we have new unit test coverage for the changes? |
|
Updated. Call for review @abbccdda
We did not really change much, only switching from The only other thing is the new error code and I just added 2 unit tests for this case. |
There was a problem hiding this comment.
Is this part of the initiative to throw a different exception? Could we update the summary of this PR?
There was a problem hiding this comment.
Could we update the meta comment to explain when a TaskAssignmentException is thrown?
There was a problem hiding this comment.
TaskAssignmentException is not a checked exception and it's just a curtesy declaration... (I can also remove throws TaskAssignmentException if you prefer).
Don't see any need to document anything further -- the code makes it clear (in fact, this method does not even throw the exception itself, but it just bubbles up from internalTopicManager.makeReady and the code documents itself:
https://github.com/apache/kafka/pull/9060/files#diff-d3963e433c59b08688bb4481faa20e97R179-R184
|
@abbccdda I updated the PR. I also realized, that it might be better to throw a |
There was a problem hiding this comment.
Since we could throw different exceptions here, would be good to add a log to indicate which type of exception is thrown.
There was a problem hiding this comment.
There are already corresponding log.error statement before those exceptions are thrown. No need to double log IMHO?
- part of KIP-572 - replace `retries` in InternalTopicManager with infinite retires plus a new timeout, based on consumer config MAX_POLL_INTERVAL_MS
bf49314 to
05b0faa
Compare
| // need to add mandatory configs; otherwise `QuietConsumerConfig` throws | ||
| consumerConfig.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class); | ||
| consumerConfig.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class); | ||
| retryTimeoutMs = new QuietConsumerConfig(consumerConfig).getInt(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG) / 2L; |
There was a problem hiding this comment.
Hey @mjsax , am I reading this PR correctly? Do we now only allow a single member to retry topic creation/validation for up to half of the poll interval, after which we shut down the entire application? That sounds like the opposite of resiliency...what if the brokers are temporarily unavailable? Before this we would just let the single thread die, and the internal topic creation/validation would be retried on the subsequent rebalance. That wasn't ideal, but given the upcoming work to allow reviving/recreating a death thread, that seems to be preferable to permanently ending the application?
Sorry if I'm misreading this, was just going over all the PRs in the last month or so to produce a diff+summary of the important ones, and want to make sure I actually understand all the changes we've made
There was a problem hiding this comment.
Apologies if this was touched on in the KIP, it's been a while and the discussion thread was quite long so I may have missed something there
There was a problem hiding this comment.
Note that the previous default was "zero retries" and thus the new default is more resilient with a 5 minute default max.poll.interval. -- But yes, we shutdown the whole app for this case now as proposed by @guozhangwang (IIRC).
There was a problem hiding this comment.
Today since we do not have ways to partially create tasks we'd have to create all topics to make sure all tasks are "complete" within each rebalance, if we cannot successfully create the topics within the poll.interval (i.e. we'd need to complete that rebalance with the poll.interval, and I guess halving it is to be more conservative), then killing that thread is not very useful anyways since we cannot proceed with the initializable tasks anyways.
That being said, with the upcoming work I'd agree that just shutdown the thread and allow users to optionally retry rebalance with new threads would be preferrable.
There was a problem hiding this comment.
cc @cadonna @wcarlson5 to bring to your attention.
There was a problem hiding this comment.
Yeah I think we should remove the shutdown error code in case of TimeoutException during internal topic validation before 2.7. I'll create a ticket so we don't lose track -- I think even just letting it kill the one thread is better than killing all of them
retriesin InternalTopicManager with infinite retires plusa new timeout, based on consumer config MAX_POLL_INTERVAL_MS
StreamsExceptionany longer (as we did when exceeding retries), but sendINCOMPLETE_SOURCE_TOPIC_METADATAerror code to let all instances shut downThird PR for KIP-572 (cf #8864 and #9047)
Call for review @vvcephei