Skip to content

KAFKA-7109: Close cached fetch sessions in the broker on consumer close#5407

Closed
stanislavkozlovski wants to merge 8 commits intoapache:trunkfrom
stanislavkozlovski:KAFKA-7109-close-fetch-sessions-on-consumer-close
Closed

KAFKA-7109: Close cached fetch sessions in the broker on consumer close#5407
stanislavkozlovski wants to merge 8 commits intoapache:trunkfrom
stanislavkozlovski:KAFKA-7109-close-fetch-sessions-on-consumer-close

Conversation

@stanislavkozlovski
Copy link
Copy Markdown
Contributor

Previously, the consumer's incremental fetch sessions would time out once the consumer was gone.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@cmccabe please review. No easy way to test this as far as I'm aware, integration tests should at least verify it doesn't break anything.

Comment thread clients/src/main/java/org/apache/kafka/clients/FetchSessionHandler.java Outdated
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 25, 2018

cc @cmccabe

Copy link
Copy Markdown
Contributor

@cmccabe cmccabe Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also clear the sessionHandlers map here, just for tidiness.

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Jul 25, 2018

Thanks for this, @stanislavkozlovski. Looks good.

We need a test, though. Take a look at clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java You should be able to write something that will verify that closing the Fetcher does the expected thing.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Initially I couldn't find a way to test this in FetcherTest.java but after your nudge I looked into it harder and found a way. Thanks @cmccabe.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that there are no partitions contained in the request?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should match the fetch session with the earlier one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Done

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Jul 28, 2018

LGTM. Thanks, @stanislavkozlovski.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Retest this please

1 similar comment
@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Retest this please

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have to do something with the futures returned by send? How do we know if the requests completed? Also, cc @hachikuji.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussed offline that there is no need to wait and ensure the requests completed. It makes sense to make it best effort, otherwise it can slow down client termination and I assume people prefer the client terminates quicker rather than make sure no unused sessions are in the broker

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always document things like that so that people reading the code are aware of it. My point was more subtle though, how do we make sure we actually sent the request?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, there's no point in having this if we close the network client before we even get to the point where the request has reached the network layer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I did not know that was a possibility. I'll dive in.
Fair point with the documentation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to wait a short time, but not very long, to send the request to the broker. I'm not sure how that interacts with the rest of the client close logic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch @ijuma. This is the code for closing the KafkaConsumer:

ClientUtils.closeQuietly(fetcher, "fetcher", firstException);
ClientUtils.closeQuietly(interceptors, "consumer interceptors", firstException);
ClientUtils.closeQuietly(metrics, "consumer metrics", firstException);
ClientUtils.closeQuietly(client, "consumer network client", firstException);
ClientUtils.closeQuietly(keyDeserializer, "consumer key deserializer", firstException);
ClientUtils.closeQuietly(valueDeserializer, "consumer value deserializer", firstException);

Meaning there's a pretty big chance the client is closed before the request are sent.

But then I read the Javadoc of #send()

Send a new request. Note that the request is not actually transmitted on the
     * network until one of the {@link #poll(long)} variants is invoked. At this
     * point the request will either be transmitted successfully or will fail.
     * Use the returned future to obtain the result of the send. Note that there is no
     * need to check for disconnects explicitly on the {@link ClientResponse} object;
     * instead, the future will be failed with a {@link DisconnectException}.

Meaning this code doesn't transmit any requests since poll is not called at all...

I see two approaches to this

  1. Call poll(Long timeout) with the request.timeout.ms configured timeout after the for loop. This will give all requests a grand total of timeout time to be sent. This call will block for a total maximum of request.timeout.ms (default is 30 seconds which seems high to me for this use case). Maybe a total timeout of N seconds, where N is the number of requests?
  2. Call poll((RequestFuture<?> future, long timeout) individually for each sent close request. This blocks on each call, so it's questionable what the timeout should be (1 second?)

In both cases, we will slow down the closing of the consumer for a bit, as we will have to block as far as I can tell.

As a sidenote - this code would also need to handle exceptions as it currently doesn't. My idea is to only catch & log the error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmccabe @ijuma what do you believe is the proper approach?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consumer close code needs to block until the fetch sessions are closed. KafkaConsumer has close() and close(long timeout, TimeUnit timeUnit). In the case of the latter one, we don't want to wait longer than the specified time.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Retest this please

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we should handle cases where this is less than 0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the Timer that we pass to the Coordinator?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using the Timer abstraction in the Fetcher methods too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although if we use the same Timer we ought to make the coordinator close with a timer of timeoutMs, not the min of timeoutMs and requestTimeoutMs, right?

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@cmccabe I've addressed the latest changes and rebased to trunk

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use closeQuietly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since it won't be able to pass the Timer class to fetcher.close. Maybe we could introduce a new interface called CloseableWithTimeout/TimeoutCloseable and have closeQuietly handle that as well?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or have one that takes a lambda so that the caller can do the close. Similar to what we have for Scala.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the Timer that we pass to the Coordinator?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using the Timer abstraction in the Fetcher methods too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Math.min not done any longer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I asked in #5407 (comment)
Since the Timer will be shared across multiple close() methods, having it as short as one requestTimeoutMs may not give sufficient time for fetcher.close(). Do you think it's okay to use a shared timer with the minimum for both close calls?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could update the timer so that the min was the requestTimeoutMs for the second close too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I think the timer for the second close should be the minimum of what time is left out of timeoutMs and requestTimeoutMs.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Retest this please

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Retest this please

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Aug 26, 2019

It may be worth reviving this, it requires a rebase.

@showuon
Copy link
Copy Markdown
Member

showuon commented Feb 4, 2022

@stanislavkozlovski , encountered similar errors and I agree we should close the incremental fetch sessions when consumer closed. Are you able to complete this PR? If not, I can co-author with you to help complete this PR to merge this good improvement. Thank you.

@showuon
Copy link
Copy Markdown
Member

showuon commented Feb 8, 2022

@stanislavkozlovski , I'm going to take over this ticket tomorrow if you don't mind. Thank you.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@showuon yes, please feel free. apologies for being slow to respond and missing the previous messages about reviving this. I will go ahead and close this PR now

@showuon
Copy link
Copy Markdown
Member

showuon commented Feb 8, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants