KAFKA-7109: Close fetch sessions on close of consumer by divijvaidya · Pull Request #12590 · apache/kafka

divijvaidya · 2022-09-05T13:34:30Z

Problem

When consumer is closed, fetch sessions associated with the consumer should notify the server about it's intention to close using a Fetch call with epoch = -1 (identified by FINAL_EPOCH in FetchMetadata.java). However, we are not sending this final fetch request in the current flow which leads to unnecessary fetch sessions on the server which are closed only after timeout.

Changes

Change close() in Fetcher to add a logic to send the final Fetch request notifying close to the server.
Change close() in Consumer to respect the timeout duration passed to it. Prior to this change, the timeout parameter was being ignored.
Change tests to close with Duration.zero to reduce the execution time of the tests. Otherwise the tests will wait for default timeout to exit (close() in the tests is expected to be unsuccessful because there is no server to send the request to).
Distinguish between the case of "close existing session and create new session" and "close existing session" by renaming the nextCloseExisting function to nextCloseExistingAttemptNew.

Testing

Added unit test which validates that the correct close request is sent to the server.

Note that this change has been attempted in #5407 but the PR was abandoned.

divijvaidya · 2022-09-07T08:18:21Z

Test failures are unrelated. @showuon this is ready for your review.

Test failures:

[2022-09-06T16:49:33.427Z] org.apache.kafka.controller.QuorumControllerTest.testEarlyControllerResults() failed, log available in /home/jenkins/workspace/Kafka_kafka-pr_PR-12590/metadata/build/reports/testOutput/org.apache.kafka.controller.QuorumControllerTest.testEarlyControllerResults().test.stdout
[2022-09-06T16:49:33.427Z] 
[2022-09-06T16:49:33.427Z] QuorumControllerTest > testEarlyControllerResults() FAILED
[2022-09-06T16:49:33.427Z]     org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while renouncing leadership: Attempt to resign from epoch 1 which is larger than the current epoch 0
[2022-09-06T16:49:33.427Z]         at app//org.apache.kafka.metalog.LocalLogManager.resign(LocalLogManager.java:740)

divijvaidya · 2022-09-07T08:22:05Z

FYI reviewer

Server handles this message at FetchSession.scala at https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/FetchSession.scala#L785

divijvaidya · 2022-09-07T08:25:50Z

FYI reviewer

This change is required since we have added new Fetch request as part of the close sequence. Thus, for a graceful close scenario, we need to mimic a server response to that Fetch request using the test utility client.respondFrom. The response might not necessarily come from the coordinator, instead the response will come from the node which was associated with the fetch session (which might be a node other than coordinator e.g. read replica)

divijvaidya · 2022-09-07T08:31:18Z

FYI reviewer

It is possible that the node is not part of the cluster any more or the connection to that node has been disconnected. In such scenarios, we don't want to try sending a final fetch request to the server. Note that the node is not necessarily the coordinator and could be another broker (such as read replica). The process of choosing a node to establish a fetch session is determined at https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L1197-L1225

divijvaidya · 2022-09-07T08:33:12Z

FYI reviewer

Note that this is a blocking call. It sends a LEAVE_GROUP request and waits for it's completion before proceeding ahead.

Would you mind adding that comment about the blocking nature of the close call to the source for posterity?

note: I think it also tries blocking commits the offset during the close.

yes you are right @philipnee and I believe the code comment above referring "send requests to the server" covers that scenario. Do you want to suggest some action here?

nope, just wanna to clarify. thanks.

divijvaidya · 2022-09-08T13:24:28Z

@dajac since you are working on the consumer client protocol, perhaps you may also be interested in taking a look at this PR?

showuon

@divijvaidya , thanks for the PR. Have a look, and left some comments. I need more time to get familiar with how fetchSession works. I'll review again. Thanks.

ijuma · 2022-09-11T16:55:23Z

What is the person seeing this log supposed to do? If they can't do anything, it should not be a warn.

This scenario (when sending a close request to the server fails) should not ideally occur. It is an error condition but we want to ignore the error since the system can recover (the session would be removed on the server when it times out) from it. In such cases of recoverable error, I prefer to add a warn so that the user can identify it as something unexpected that occurred on the system. The action that the user could take will be based on the exception trace printed here (perhaps their auth creds were incorrect?).

For the other case of log.warn a few lines below this, in the upcoming commit, I have added a suggestion to the user to increase the close timeout for KafkaConsumer.

If their credentials are incorrect, they would get other errors somewhere else, right? Generally, warn errors like this tend to generate a bunch of confusion without much benefit. In a distributed system, requests are expected to fail at times. Given that we have lived for years without closing sessions, it seems unnecessary to now warn so aggressively when it fails.

Makes sense. I changed it to info in the latest commit. I still have warn left over in the other log statement (when the close times out). I have added a suggestion to the user in the warn message to increase the timeout. Please review that log statement and let me know whether that needs to change to info too.

showuon

Overall LGTM! Left some comments. Thanks.

showuon

LGTM! Thanks for the fix.

dajac · 2022-09-20T08:33:04Z

Perhaps a naive question but does the fetch request to close the session fetches any records? Or does it just close the session and return?

showuon · 2022-09-20T09:40:51Z

Perhaps a naive question but does the fetch request to close the session fetches any records? Or does it just close the session and return?

@dajac , good question. When consumer closing, it'll leave group first, and then close fetcher. I thought leaving group will clear the owned partition, but looks like it won't. Maybe we need to update in broker side, to not return records when client is trying to close the session and not create a new one. @divijvaidya , WDYT?

divijvaidya · 2022-09-20T14:13:15Z

does the fetch request to close the session fetches any records

No, because the fetch request's field for topic partitions is set to empty at sessionHandler.newBuilder().build() (line 1963 at Fetcher.java). Also, note that the empty fetch data in the close-fetch-request is asserted in the test at testFetcherCloseClosesFetchSessionsInBroker at assertTrue(builder.fetchData().isEmpty());

On the server side, the server handles a close fetch message by creating a SessionlessFetchContext which will return an empty response if FetchData is empty (see FetchSession.scala line 364)

Maybe we need to update in broker side, to not return records when client is trying to close the session and not create a new one

As explained above, both of these cases are already handled in the server by creation of a SessionlessFetchContext

@showuon @dajac Please let me know if I am missing anything here.

showuon · 2022-09-21T01:40:07Z

@divijvaidya , you're right, thanks for the update! Sorry, I only checked the broker side implementation that although we created SessionlessFetchContext, it still return the fetched records. I missed the part that we sent the fetch request with "empty" fetchData. So, we're good! Thanks again for the explanation!

divijvaidya · 2022-09-21T10:06:26Z

The failing tests pass locally and are unrelated to this change.

Build / JDK 8 and Scala 2.12 / testReturnRecordsDuringRebalance() – org.apache.kafka.clients.consumer.KafkaConsumerTest
Build / JDK 17 and Scala 2.13 / testCreateTopicsReturnsConfigs(String).quorum=zk – kafka.api.PlaintextAdminIntegrationTest
Build / JDK 17 and Scala 2.13 / testCloseOldestConnection() – org.apache.kafka.common.network.Tls13SelectorTest

Note that testReturnRecordsDuringRebalance has been failing in other PRs too such as #12457 which makes me believe that it is not due to this change.

dajac

@divijvaidya Thanks for the clarification. That seems right to me. I took a look at the PR and left a few comments/questions.

dajac · 2022-09-21T12:56:40Z

What would happen if the session handler is reused after this is called? Should we add unit tests in FetchSessionHandlerTest to be complete?

SessionHandler should not be reused here after close because:

We drain all completed fetches before calling close of sessions. Hence, no completed fetches will use session.

Fetcher is only called from the Consumer. Consumer has a single threaded access i.e. while it is processing the close, we don't expect it to poll or call Fetcher.sendFetches, session handler will not be used.

SessionHandler map will be cleared after the close request is sent in the Fetcher.close()

We have ensured that no other thread (e.g. FetchResponse future) can use Fetcher while it is being closed by acquiring a lock on Fetcher (at synchronized (Fetcher.this)) before close starts. This ensures that sessionHandler is not called by anyone before close is complete (which should clear the sessionHandler map).

Is my understanding correct here?

Regarding the test, what kind validation/assertion would you like to see from it? I can't think of a test that might be useful for us here.

divijvaidya · 2022-09-27T16:23:41Z

@dajac this is ready for your review, whenever you get a chance

divijvaidya · 2022-10-10T16:01:08Z

@dajac please take a look when you get a chance!

kirktrue

Thanks @divijvaidya!

Thank you for adding clarity in your method naming and in the comments you added (especially in the tests). That in addition to fixing the issue.

kirktrue · 2022-10-19T22:31:51Z

Would you mind adding that comment about the blocking nature of the close call to the source for posterity?

divijvaidya · 2022-10-26T09:59:24Z

@kirktrue @dajac please take a look when you get a chance!

divijvaidya · 2022-11-17T14:32:16Z

@dajac please take a look!

divijvaidya · 2022-12-05T11:02:51Z

@ijuma would you please take a look when you get a chance?

dajac

@divijvaidya Please excuse me for the delay on this one. I picked it up again. I left a few comments/questions. I think that it is too late for 3.4 though but let's get it merged in before xmax holidays.

dajac · 2022-12-09T17:13:21Z

Are we loosing anything by removing all those try resources? It basically means that the consumer is not closed in case of an exception.

yes. try-with-resources is calling KafkaConsumer.close() which has a timeout set to 30s. These tests will wait for default timeout to exit (close() in the tests is expected to be unsuccessful because there is no server to send the request to) and hence, unnecessarily, increase the execution time of these tests.

dajac · 2022-12-09T17:13:51Z

Is the third one not necessary?

Yes, idempotency can be verified by two calls to close(), third one just adds to the total running time of this test.

I have modified this test to verify using deterministic methods.

dajac · 2022-12-09T17:33:51Z

If we are doing this, could we adopt the new style directly:

new Fetcher( a, b, .... );

Sorry, I did not understand this comment. Was your suggestion about the indentation of the next line? I have reverted it to original i.e. 8 space indent.

philipnee · 2022-12-21T21:19:05Z

note: I think it also tries blocking commits the offset during the close.

philipnee · 2022-12-21T21:20:20Z

mind elaborate on when the time object can be null? maybe when we inject the time object?

The close() could be called from the constructor of this class itself when an exception is thrown before the field this.time could be initialized. I can avoid it by setting the field as the first thing that happens in the constructor but then we could be creating a coupling with the initialization order of fields in the constructor to the close method. Instead, to be on the safer side, a better approach IMO is to handle the null case explicitly.

I have added a comment explaining when it could be null in the upcoming commit.

How about Objects.requireNotNull(time) in the constructor? So that we could avoid the null handling.

divijvaidya · 2022-12-27T19:01:34Z

@dajac @philipnee please review again (and restart the tests) when you get a chance! Thank you.

Unrelated test failures. UnitTest are successful in my local environment. Failing integration tests are known flaky tests.

Build / JDK 17 and Scala 2.13 / testTaskRequestWithOldStartMsGetsUpdated() – org.apache.kafka.trogdor.coordinator.CoordinatorTest
2m 0s
Build / JDK 11 and Scala 2.13 / testPatternSubscriptionWithTopicAndGroupRead(String).quorum=kraft – kafka.api.AuthorizerIntegrationTest
6s
Build / JDK 8 and Scala 2.12 / testListenerConnectionRateLimitWhenActualRateAboveLimit() – kafka.network.ConnectionQuotasTest

philipnee · 2022-12-27T21:39:20Z

Thanks @divijvaidya - I don't have more questions regarding this PR.

showuon · 2023-02-08T12:44:14Z

@dajac , do you have any other comments?

dajac

@divijvaidya Thanks for the patch and my apologies for the delay on it. I had very limited time for reviews. Overall, the patch LGTM. I left a new nits. We should be able to merge it once they are addressed.

dajac

@divijvaidya Thanks for the update. I left a few more minor comments.

dajac · 2023-02-08T17:13:24Z

+            try {
+                code.run();
+            } catch (Throwable t) {
+                log.warn("{} error", what, t);


Should this be an error instead of a warn to be consistent with closeQuietly? Moreover, I wonder if we could improve the error message. We would get something like fetcher close error which is not really inline with what we usually log. For instance, closeQuietly would log something like Failed to close fetch.... Do you have any thoughts on this?

Fair point. Though not all "swallows" will require error level. So, I created a generic logging function based on CoreUtils.swallow and then used error for this specific instance of closing fetcher and coordinator. I have also updated the comment.

Let me know if that look right? Happy to change it further.

dajac

LGTM, thanks for the patch @divijvaidya!

divijvaidya · 2023-02-10T09:45:51Z

Thank you @dajac for your patience through this PR. It took a long time but it would definitely improve consumers! Cheers!

dajac · 2023-02-10T09:48:43Z

@divijvaidya Sorry again for the long delay. By the way, I was wondering if we should also do this in the AbstractFetcherThread in order to close sessions used by replication when a broker shuts down. I haven't looked into it but that may be an interesting improvement as well.

divijvaidya · 2023-02-10T09:54:11Z

@divijvaidya Sorry again for the long delay. By the way, I was wondering if we should also do this in the AbstractFetcherThread in order to close sessions used by replication when a broker shuts down. I haven't looked into it but that may be an interesting improvement as well.

Thanks for the suggestion. I will look into it.

#13248) I noticed this issue when tracing #12590. StreamThread closes the consumer before changing state to DEAD. If the partition rebalance happens quickly, the other StreamThreads can't change KafkaStream state from REBALANCING to RUNNING since there is a PENDING_SHUTDOWN StreamThread Reviewers: Guozhang Wang <wangguoz@gmail.com>

dengziming · 2023-10-17T02:17:45Z

https://issues.apache.org/jira/browse/KAFKA-15619

Deleted topics will come back again in Apache Spark structured streaming stress test after upgrade Kafka from 3.4.0 to 3.5.0, related ticket is: https://issues.apache.org/jira/browse/SPARK-45529 , the test randomly starts/stops/adds data/add partitions/delete topic/add topic/checks the result in a loop, I finally found that a deleted topic will come back again after some time.

By constantly reseting the head of branch-3.5 and using gradlew install to repackage and rerunning of the stress test, I am basically certain that this submission caused it.

Haven't go through the details of the PR, do you have any ideas @divijvaidya @dajac @showuon ?

showuon · 2023-10-17T02:31:29Z

@dengziming , let's discuss it in the KAFKA-15619

divijvaidya changed the title ~~KAFKA-7109: Close fetch sessions on close of consumer~~ KAFKA-7109: (draft) Close fetch sessions on close of consumer Sep 5, 2022

divijvaidya force-pushed the KAFKA-7109 branch 2 times, most recently from 1b7da9d to 62ef313 Compare September 6, 2022 16:40

divijvaidya changed the title ~~KAFKA-7109: (draft) Close fetch sessions on close of consumer~~ KAFKA-7109: Close fetch sessions on close of consumer Sep 7, 2022

divijvaidya commented Sep 7, 2022

View reviewed changes

showuon reviewed Sep 9, 2022

View reviewed changes

showuon reviewed Sep 11, 2022

View reviewed changes

Comment thread clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java Outdated

Comment thread clients/src/main/java/org/apache/kafka/common/requests/FetchMetadata.java Outdated

ijuma reviewed Sep 11, 2022

View reviewed changes

showuon reviewed Sep 13, 2022

View reviewed changes

divijvaidya force-pushed the KAFKA-7109 branch from 42f4e7a to c95a79c Compare September 13, 2022 09:16

showuon approved these changes Sep 13, 2022

View reviewed changes

divijvaidya force-pushed the KAFKA-7109 branch from c95a79c to ad5c5c8 Compare September 20, 2022 08:11

dajac reviewed Sep 21, 2022

View reviewed changes

kirktrue reviewed Oct 19, 2022

View reviewed changes

dajac reviewed Dec 9, 2022

View reviewed changes

philipnee reviewed Dec 21, 2022

View reviewed changes

divijvaidya force-pushed the KAFKA-7109 branch from b3df3c8 to acb83e6 Compare December 27, 2022 14:40

dajac reviewed Feb 8, 2023

View reviewed changes

divijvaidya added 2 commits February 8, 2023 17:10

close fetcher

e3d3e1c

address pr comments

798dfa7

divijvaidya force-pushed the KAFKA-7109 branch from b4c64ae to 798dfa7 Compare February 8, 2023 16:27

Merge branch 'trunk' into KAFKA-7109

459b458

dajac reviewed Feb 8, 2023

View reviewed changes

divijvaidya added 3 commits February 8, 2023 20:34

address comments

0786164

keep parity with existing log for tests

ec6e48c

Fix logging

ff731f1

dajac approved these changes Feb 9, 2023

View reviewed changes

dajac merged commit e903f2c into apache:trunk Feb 9, 2023

divijvaidya deleted the KAFKA-7109 branch February 10, 2023 09:46

chia7712 mentioned this pull request Feb 14, 2023

KAFKA-14717 KafkaStreams can' get running if the rebalance happens be… #13248

Merged

3 tasks

dengziming mentioned this pull request Oct 17, 2023

Revert "[SPARK-45502][BUILD] Upgrade Kafka to 3.6.0" apache/spark#43379

Closed

kirktrue mentioned this pull request Jan 27, 2025

KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

Merged

3 tasks

ramitg254 mentioned this pull request Nov 12, 2025

HIVE-29238:upgrade kafka version to fix CVE-2024-31141 and CVE-2021-3… apache/hive#6110

Merged

Conversation

divijvaidya commented Sep 5, 2022

Problem

Changes

Testing

Uh oh!

divijvaidya commented Sep 7, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divijvaidya commented Sep 8, 2022

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

showuon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dajac commented Sep 20, 2022

Uh oh!

showuon commented Sep 20, 2022

Uh oh!

divijvaidya commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

showuon commented Sep 21, 2022

Uh oh!

divijvaidya commented Sep 21, 2022

Uh oh!

dajac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divijvaidya commented Sep 27, 2022

Uh oh!

divijvaidya commented Oct 10, 2022

showuon left a comment •

edited

Loading

divijvaidya commented Sep 20, 2022 •

edited

Loading