KAFKA-5697: issue Consumer#wakeup during Streams shutdown#4930
KAFKA-5697: issue Consumer#wakeup during Streams shutdown#4930vvcephei wants to merge 10 commits intoapache:trunkfrom vvcephei:streams-client-wakeup-on-shutdown
Conversation
|
@guozhangwang @mjsax @bbejeck , Guozhang, Jason, and I had a discussion this morning and decided to go ahead and put this change in to address the immediate issue and allow the longer-term fix (#4855 and KIP-266) to proceed at their own pace. I opted to go ahead and replace the |
guozhangwang
left a comment
There was a problem hiding this comment.
@vvcephei This is a meta comment on this approach:
Note that for the global thread, in each loop (at the beginning of the loop we check isRunning)
1.a: We may call consumer.poll once in pollAndUpdate.
1.b: We may call consumer.poll multiple times in a while (offset < highWatermark) loop.
- For the normal consumer, we only call
pollonce in each loop after checkingisRunning.
3.a For the restore consumer, we call poll() once in maybeUpdateStandbyTasks, it is covered by checking if (state == State.RUNNING && taskManager.hasStandbyRunningTasks()) again.
3.b. And we call poll() once in updateNewAndRestoringTasks, which is called at most once in the loop.
So for 1.b which is called multiple times in the loop, do we need to special handle this case?
There was a problem hiding this comment.
Do we need this constructor? Shouldn't it be auto generated if no other constructors are given?
There was a problem hiding this comment.
the private constructor in a final class actually makes it impossible to instantiate the class, ensuring it can only be used for its static methods.
This is basically a pattern for declaring a "static-methods-only" class.
There was a problem hiding this comment.
nit: rename with ConsumerUtils to be consistent with other classes?
|
@vvcephei thanks for the patch overall looks good. Test failures are relevant though. |
|
The Actually, this was never guaranteed, but I guess the specific interactions would get the right metadata actions enqueued before the call to close, which would then block on them before the assertions were made. In other words, it was always a race, but now we're certain to lose it. The only solution I could think of was to block until the processing is complete, and the only way I could think to do that is to wait for lag to hit 0. I was surprised that I couldn't get at the consumer metrics. Is this expected? I think there's some aspect of the system design I'm missing here. I just added them provisionally to the streams metrics, but I'm not sure if this is desired either (at the least, it might be kip-worthy). |
|
I think we have a similar situation in some other places as well, my proposal was to wait for the number of records being consumed from the sink topic with |
@vvcephei have you thought about this? |
|
Ah, I saw those concerns previously, but overlooked them on this latest pass. Let me think about it, @guozhangwang . |
|
Yeah, all the cases you listed are fine, except 1.b, which will just loop around and get stuck in poll again (since Wakeup isn't sticky). In the case of my other approach, async poll, it would be an infinite loop instead. I'll need some time to come up with an elegant solution. Thanks for the catch! |
|
Actually, the global state case is even more different from the other streams threads. The GlobalStreamsThread overrides By contrast, the local threads won't block the call to This is important because if we have any global state, we wouldn't be able to call This seems like a problem, so I went ahead and removed that blocking code, instead just throwing in the Global thread if there's any trouble initializing. This is worthy of special scrutiny because I'm not sure why it was set up like that to begin with. I asked @dguy , but it's night for him right now, so he hasn't had a chance to answer yet. It also turns out that GST runs into a blocking consumer call even before it calls poll, when it tries to list the partitions for its topics. The bummer is that there didn't seem to be any safe semantics that would let me to treat a Wakeup like a "no-op" response. If I say that there aren't any partitions, we'd throw an exception to the top level, which is good because that's normally a user mistake. Also, the call stack between All in all, I'm not to sure about this "solution", but after kicking around various options all day, I figured it was time to get other opinions. |
|
@bbejeck For some reason, GH wouldn't let me mention you in the last comment... What do you think about it? |
|
@guozhangwang Ok, I talked to @dguy and he confirmed that the GlobalStreamThread does need to complete initialization before everything else starts. I can add a thread to issue wakeups to the client on an interval and treat the wakeups as timeouts, checking the state and retrying until we get a connection (or until we run out of retries). OR we can fix the main threads and leave the GST blocking startup until KIP-266 goes through, since the wakeups will then be unnecessary. |
|
@vvcephei Thanks for the detailed explanation. I agree that the global thread's case is different and that's partially why we introduced So let's only fix this in main thread as for KAFKA-5697 itself and consider global thread as part of KIP-266 later. |
|
@vvcephei thanks for the details. I agree with @guozhangwang, at this point with KIP-266 looming we just change the main thread and deal with GST when we have the timeouts in place. |
|
I've just confirmed that the wait-on-metrics method is correct wrt @guozhangwang 's earlier concern that the producer needs to flush before we check the lag. This already does happen (ultimately, in |
| if (globalStreamThread != null) result.putAll(globalStreamThread.consumerMetrics()); | ||
| result.putAll(metrics.metrics()); | ||
| return Collections.unmodifiableMap(result); | ||
| } |
There was a problem hiding this comment.
Note that I'm adding all the internal consumer metrics to the map returned by metrics. Is this ok?!?!?
I think so, since it's supposed to be a "map of all metrics", so I'd argue it's not a semantic change, and all the Streams metrics will still be there, so nothing will break.
But it's clearly not what we had before, so I want to ask explicitly.
I guess the alternative for anyone, like me, who wants to get the lag metrics is to register a metric reporter for the consumers and capture the metrics that way, but I would question the existence of this method if it only returns some of the metrics and you have to implement your own reporter to capture the others...
There was a problem hiding this comment.
I'm not sure seems fine to me, but I'll let others weigh in.
There was a problem hiding this comment.
Thanks for the catch.
I agree that we should be able to return all the embedded client's metrics as well; thinking about this a bit more, we should also include the producer metrics as well.
I'll go ahead and merge this PR as is while leaving a TODO marker for @vvcephei
|
@guozhangwang @bbejeck @mjsax Call for final reviews. |
| stateRestoreListener, | ||
| streamsConfig); | ||
| streamsConfig, | ||
| new GlobalStateManagerImpl.IsRunning() { |
There was a problem hiding this comment.
nit: since this is used 4 times in the test create one instance?
| if (globalStreamThread != null) result.putAll(globalStreamThread.consumerMetrics()); | ||
| result.putAll(metrics.metrics()); | ||
| return Collections.unmodifiableMap(result); | ||
| } |
There was a problem hiding this comment.
I'm not sure seems fine to me, but I'll let others weigh in.
|
java 8 failure was: java 10 failure was |
|
@bbejeck I think that the wakeup/shutdown code still has some value in GST post-startup, but I'm also ok with ditching the code you mentioned. I don't think I'll be able to check in again by the time the others weigh in, so I'll just say that you all should feel free to wait on this until I get back, or even modify it and merge it, if you don't like the current state. I'm fine any which way. |
| if (globalStreamThread != null) result.putAll(globalStreamThread.consumerMetrics()); | ||
| result.putAll(metrics.metrics()); | ||
| return Collections.unmodifiableMap(result); | ||
| } |
There was a problem hiding this comment.
Thanks for the catch.
I agree that we should be able to return all the embedded client's metrics as well; thinking about this a bit more, we should also include the producer metrics as well.
I'll go ahead and merge this PR as is while leaving a TODO marker for @vvcephei
|
Merged to trunk. |
Wakeup consumers during shutdown to break them out of any internally blocking calls. Semantically, it should be fine to treat a WakeupException as "no work to do", which will then continue the threads' polling loops, leading them to discover that they are supposed to shut down, which they will do gracefully. The existing tests should be sufficient to verify no regressions. Author: John Roesler <john@confluent.io> Reviewers: Bill Bejeck <bbejeck@gmail.com>, Guozhang Wang <wangguoz@gmail.com> Closes apache#4930 from vvcephei/streams-client-wakeup-on-shutdown minor javadoc updates
Wakeup consumers during shutdown to break them out of any internally blocking calls.
Semantically, it should be fine to treat a WakeupException as "no work to do", which will then continue the threads' polling loops, leading them to discover that they are supposed to shut down, which they will do gracefully.
The existing tests should be sufficient to verify no regressions.
Committer Checklist (excluded from commit message)