KAFKA-9274: handle TimeoutException on task reset#10000
Conversation
|
congratulations on PR 10,000 :) |
There was a problem hiding this comment.
I was considering to maybe merge this method into initMetadata() but it might convolute different code path, and we should execute this method rarely anyway so I don't think we should have concerns about calling mainConsumer.committed twice for rare cases.
Let me know what you think.
There was a problem hiding this comment.
Hm...I'm not necessarily that concerned about calling mainConsumer.committed twice in rare cases (although maybe that would not be so good, since those rare cases happen to be those in which this is probably more likely to time out, right?)
But personally, just coming into this code from the outside, it's super confusing to have two different methods for initializing the offsets. It seems more convoluted that way, to me. Also maybe I am missing some context here but why do we call initOffsetsIfNeeded from initializeIfNeeded rather than from completeRestoration in the first place? We don't need to initialize main consumer offsets until it transitions to running
There was a problem hiding this comment.
this if is not strictly required, however, it allows us to just pass null as offsetResetter in tests, so might be worth it.
There was a problem hiding this comment.
Can we just pass in a no-op lambda instead? I'd rather avoid special handling for null input that isn't supposed to be null, just so we can use null in the tests (which are therefore not realistic tests since it should never be null, no?)
5a457f0 to
ce7eeb4
Compare
|
Updated this PR. |
|
|
||
| case RESTORING: | ||
| initializeMetadata(); | ||
| resetOffsetsIfNeededAndInitializeMetadata(offsetResetter); |
There was a problem hiding this comment.
cool, thanks, this seems much cleaner to me
ableegoldman
left a comment
There was a problem hiding this comment.
Ok I think this LG -- but I'll be happy to get this soaking asap
…)" This reverts commit 0bc394c.
…he#10000) This PR was removed by accident in trunk and 2.8, bringing it back.
…) (#10372) This PR was removed by accident in trunk and 2.8, bringing it back. Co-authored-by: Matthias J. Sax <matthias@confluent.io> Reviewers: Matthias J. Sax <matthias@confluent.io>
…he#10000) (apache#10372) This PR was removed by accident in trunk and 2.8, bringing it back. Co-authored-by: Matthias J. Sax <matthias@confluent.io> Reviewers: Matthias J. Sax <matthias@confluent.io>
…he#10000) This PR was removed by accident in trunk and 2.8, bringing it back.
…) (#10374) This PR was removed by accident in trunk and 2.8, bringing it back. Co-authored-by: Matthias J. Sax <matthias@confluent.io> Reviewers: Matthias J. Sax <matthias@confluent.io>
…he#10000) (apache#10372) This PR was removed by accident in trunk and 2.8, bringing it back. Co-authored-by: Matthias J. Sax <matthias@confluent.io> Reviewers: Matthias J. Sax <matthias@confluent.io>
This changes move the offset reset for the internal "main consumer" when we revive a corrupted task, from the "task cleanup" code path, to the "task init" code path. For this case, we have already logic in place to handle
TimeoutExceptionthat might be thrown byconsumer#committed()method call.