KAFKA-7109: Close fetch sessions on close of consumer#12590
KAFKA-7109: Close fetch sessions on close of consumer#12590dajac merged 6 commits intoapache:trunkfrom
Conversation
1b7da9d to
62ef313
Compare
|
Test failures are unrelated. @showuon this is ready for your review. Test failures: |
There was a problem hiding this comment.
FYI reviewer
Server handles this message at FetchSession.scala at https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/FetchSession.scala#L785
There was a problem hiding this comment.
FYI reviewer
This change is required since we have added new Fetch request as part of the close sequence. Thus, for a graceful close scenario, we need to mimic a server response to that Fetch request using the test utility client.respondFrom. The response might not necessarily come from the coordinator, instead the response will come from the node which was associated with the fetch session (which might be a node other than coordinator e.g. read replica)
There was a problem hiding this comment.
FYI reviewer
It is possible that the node is not part of the cluster any more or the connection to that node has been disconnected. In such scenarios, we don't want to try sending a final fetch request to the server. Note that the node is not necessarily the coordinator and could be another broker (such as read replica). The process of choosing a node to establish a fetch session is determined at https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L1197-L1225
There was a problem hiding this comment.
FYI reviewer
Note that this is a blocking call. It sends a LEAVE_GROUP request and waits for it's completion before proceeding ahead.
There was a problem hiding this comment.
Would you mind adding that comment about the blocking nature of the close call to the source for posterity?
There was a problem hiding this comment.
note: I think it also tries blocking commits the offset during the close.
There was a problem hiding this comment.
yes you are right @philipnee and I believe the code comment above referring "send requests to the server" covers that scenario. Do you want to suggest some action here?
There was a problem hiding this comment.
nope, just wanna to clarify. thanks.
|
@dajac since you are working on the consumer client protocol, perhaps you may also be interested in taking a look at this PR? |
showuon
left a comment
There was a problem hiding this comment.
@divijvaidya , thanks for the PR. Have a look, and left some comments. I need more time to get familiar with how fetchSession works. I'll review again. Thanks.
There was a problem hiding this comment.
What is the person seeing this log supposed to do? If they can't do anything, it should not be a warn.
There was a problem hiding this comment.
This scenario (when sending a close request to the server fails) should not ideally occur. It is an error condition but we want to ignore the error since the system can recover (the session would be removed on the server when it times out) from it. In such cases of recoverable error, I prefer to add a warn so that the user can identify it as something unexpected that occurred on the system. The action that the user could take will be based on the exception trace printed here (perhaps their auth creds were incorrect?).
For the other case of log.warn a few lines below this, in the upcoming commit, I have added a suggestion to the user to increase the close timeout for KafkaConsumer.
There was a problem hiding this comment.
If their credentials are incorrect, they would get other errors somewhere else, right? Generally, warn errors like this tend to generate a bunch of confusion without much benefit. In a distributed system, requests are expected to fail at times. Given that we have lived for years without closing sessions, it seems unnecessary to now warn so aggressively when it fails.
There was a problem hiding this comment.
Makes sense. I changed it to info in the latest commit. I still have warn left over in the other log statement (when the close times out). I have added a suggestion to the user in the warn message to increase the timeout. Please review that log statement and let me know whether that needs to change to info too.
showuon
left a comment
There was a problem hiding this comment.
Overall LGTM! Left some comments. Thanks.
42f4e7a to
c95a79c
Compare
c95a79c to
ad5c5c8
Compare
|
Perhaps a naive question but does the fetch request to close the session fetches any records? Or does it just close the session and return? |
@dajac , good question. When consumer closing, it'll leave group first, and then close fetcher. I thought leaving group will clear the owned partition, but looks like it won't. Maybe we need to update in broker side, to not return records when client is trying to close the session and not create a new one. @divijvaidya , WDYT? |
No, because the fetch request's field for topic partitions is set to empty at On the server side, the server handles a close fetch message by creating a
As explained above, both of these cases are already handled in the server by creation of a @showuon @dajac Please let me know if I am missing anything here. |
|
@divijvaidya , you're right, thanks for the update! Sorry, I only checked the broker side implementation that although we created |
|
The failing tests pass locally and are unrelated to this change. Note that |
dajac
left a comment
There was a problem hiding this comment.
@divijvaidya Thanks for the clarification. That seems right to me. I took a look at the PR and left a few comments/questions.
There was a problem hiding this comment.
What would happen if the session handler is reused after this is called? Should we add unit tests in FetchSessionHandlerTest to be complete?
There was a problem hiding this comment.
SessionHandler should not be reused here after close because:
- We drain all completed fetches before calling close of sessions. Hence, no completed fetches will use session.
Fetcheris only called from theConsumer.Consumerhas a single threaded access i.e. while it is processing theclose, we don't expect it to poll or callFetcher.sendFetches, session handler will not be used.SessionHandlermap will be cleared after the close request is sent in theFetcher.close()- We have ensured that no other thread (e.g. FetchResponse future) can use
Fetcherwhile it is being closed by acquiring a lock onFetcher(atsynchronized (Fetcher.this)) before close starts. This ensures that sessionHandler is not called by anyone before close is complete (which should clear the sessionHandler map).
Is my understanding correct here?
Regarding the test, what kind validation/assertion would you like to see from it? I can't think of a test that might be useful for us here.
|
@dajac this is ready for your review, whenever you get a chance |
|
@dajac please take a look when you get a chance! |
kirktrue
left a comment
There was a problem hiding this comment.
Thanks @divijvaidya!
Thank you for adding clarity in your method naming and in the comments you added (especially in the tests). That in addition to fixing the issue.
There was a problem hiding this comment.
Would you mind adding that comment about the blocking nature of the close call to the source for posterity?
|
@dajac please take a look! |
|
@ijuma would you please take a look when you get a chance? |
dajac
left a comment
There was a problem hiding this comment.
@divijvaidya Please excuse me for the delay on this one. I picked it up again. I left a few comments/questions. I think that it is too late for 3.4 though but let's get it merged in before xmax holidays.
There was a problem hiding this comment.
Are we loosing anything by removing all those try resources? It basically means that the consumer is not closed in case of an exception.
There was a problem hiding this comment.
yes. try-with-resources is calling KafkaConsumer.close() which has a timeout set to 30s. These tests will wait for default timeout to exit (close() in the tests is expected to be unsuccessful because there is no server to send the request to) and hence, unnecessarily, increase the execution time of these tests.
There was a problem hiding this comment.
Yes, idempotency can be verified by two calls to close(), third one just adds to the total running time of this test.
There was a problem hiding this comment.
I have modified this test to verify using deterministic methods.
There was a problem hiding this comment.
If we are doing this, could we adopt the new style directly:
new Fetcher(
a,
b,
....
);
There was a problem hiding this comment.
Sorry, I did not understand this comment. Was your suggestion about the indentation of the next line? I have reverted it to original i.e. 8 space indent.
There was a problem hiding this comment.
note: I think it also tries blocking commits the offset during the close.
There was a problem hiding this comment.
mind elaborate on when the time object can be null? maybe when we inject the time object?
There was a problem hiding this comment.
The close() could be called from the constructor of this class itself when an exception is thrown before the field this.time could be initialized. I can avoid it by setting the field as the first thing that happens in the constructor but then we could be creating a coupling with the initialization order of fields in the constructor to the close method. Instead, to be on the safer side, a better approach IMO is to handle the null case explicitly.
I have added a comment explaining when it could be null in the upcoming commit.
There was a problem hiding this comment.
How about Objects.requireNotNull(time) in the constructor? So that we could avoid the null handling.
b3df3c8 to
acb83e6
Compare
|
@dajac @philipnee please review again (and restart the tests) when you get a chance! Thank you. Unrelated test failures. UnitTest are successful in my local environment. Failing integration tests are known flaky tests. |
|
Thanks @divijvaidya - I don't have more questions regarding this PR. |
|
@dajac , do you have any other comments? |
dajac
left a comment
There was a problem hiding this comment.
@divijvaidya Thanks for the patch and my apologies for the delay on it. I had very limited time for reviews. Overall, the patch LGTM. I left a new nits. We should be able to merge it once they are addressed.
b4c64ae to
798dfa7
Compare
dajac
left a comment
There was a problem hiding this comment.
@divijvaidya Thanks for the update. I left a few more minor comments.
| try { | ||
| code.run(); | ||
| } catch (Throwable t) { | ||
| log.warn("{} error", what, t); |
There was a problem hiding this comment.
Should this be an error instead of a warn to be consistent with closeQuietly? Moreover, I wonder if we could improve the error message. We would get something like fetcher close error which is not really inline with what we usually log. For instance, closeQuietly would log something like Failed to close fetch.... Do you have any thoughts on this?
There was a problem hiding this comment.
Fair point. Though not all "swallows" will require error level. So, I created a generic logging function based on CoreUtils.swallow and then used error for this specific instance of closing fetcher and coordinator. I have also updated the comment.
Let me know if that look right? Happy to change it further.
dajac
left a comment
There was a problem hiding this comment.
LGTM, thanks for the patch @divijvaidya!
|
Thank you @dajac for your patience through this PR. It took a long time but it would definitely improve consumers! Cheers! |
|
@divijvaidya Sorry again for the long delay. By the way, I was wondering if we should also do this in the AbstractFetcherThread in order to close sessions used by replication when a broker shuts down. I haven't looked into it but that may be an interesting improvement as well. |
Thanks for the suggestion. I will look into it. |
#13248) I noticed this issue when tracing #12590. StreamThread closes the consumer before changing state to DEAD. If the partition rebalance happens quickly, the other StreamThreads can't change KafkaStream state from REBALANCING to RUNNING since there is a PENDING_SHUTDOWN StreamThread Reviewers: Guozhang Wang <wangguoz@gmail.com>
|
https://issues.apache.org/jira/browse/KAFKA-15619 Deleted topics will come back again in Apache Spark structured streaming stress test after upgrade Kafka from 3.4.0 to 3.5.0, related ticket is: https://issues.apache.org/jira/browse/SPARK-45529 , the test randomly starts/stops/adds data/add partitions/delete topic/add topic/checks the result in a loop, I finally found that a deleted topic will come back again after some time. By constantly reseting the head of branch-3.5 and using Haven't go through the details of the PR, do you have any ideas @divijvaidya @dajac @showuon ? |
|
@dengziming , let's discuss it in the KAFKA-15619 |
Problem
When consumer is closed, fetch sessions associated with the consumer should notify the server about it's intention to close using a Fetch call with epoch = -1 (identified by
FINAL_EPOCHinFetchMetadata.java). However, we are not sending this final fetch request in the current flow which leads to unnecessary fetch sessions on the server which are closed only after timeout.Changes
close()inFetcherto add a logic to send the final Fetch request notifying close to the server.close()inConsumerto respect the timeout duration passed to it. Prior to this change, the timeout parameter was being ignored.Duration.zeroto reduce the execution time of the tests. Otherwise the tests will wait for default timeout to exit (close() in the tests is expected to be unsuccessful because there is no server to send the request to).nextCloseExistingfunction tonextCloseExistingAttemptNew.Testing
Added unit test which validates that the correct close request is sent to the server.
Note that this change has been attempted in #5407 but the PR was abandoned.