KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread by pgwhalen · Pull Request #6283 · apache/kafka

pgwhalen · 2019-02-18T01:17:25Z

When calling readLogToEnd(), the KafkaBasedLog worker thread should
catch TimeoutException and log a warning, which can occur if brokers
are unavailable, otherwise the worker thread terminates.
Includes an enhancement to MockConsumer that allows simulating
exceptions not just when polling but also when querying for offsets,
which is necessary for testing the fix.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

- When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates. - Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

pgwhalen · 2019-02-18T01:30:35Z

This is my first contribution and I haven't exactly poured over the entirety of the Connect runtime, so I'll be curious to see if I'm missing something entirely.

A few things I'd like reviewers to note that I thought about myself:

Is WARN an appropriate log level, or should there even be a log at all?
Given that the problem is essentially "uncaught exception causes important thread to terminate," should we "catch and continue" for yet more exceptions to prevent similar issues?
More broadly - Is there a way to alert on the healthiness of Connect overall, perhaps through the REST interface? Maybe the biggest issue with the bug was that it was hard to detect - tasks still reported themselves as "running," meaning the only way to discover the worker thread had died was by monitoring the expected output of our connectors (and noticing that there was none).

pgwhalen · 2019-03-23T19:41:52Z

retest this please

pgwhalen · 2019-04-25T00:52:25Z

Ping @mjsax @kkonstantine @wicknicks @rhauch (not totally sure, but I've seen these folks reviewing other connect PRs)

rhauch

Thanks, @pgwhalen -- nice catch! I think this was not considered an issue is because the internal poll(Integer.MAX_VALUE) call in readToEnd() should never time out. But as you pointed out, there are other method invocations, and timeouts on those should be handled as well.

I have a minor suggestion to help make the log message more readable, and (if you're in the code) another to move the two setter methods on MockConsumer together. Otherwise, this looks great.

rhauch · 2019-05-01T01:11:04Z

                            readToLogEnd();
                            log.trace("Finished read to end log for topic {}", topic);
+                        } catch (TimeoutException e) {
+                            log.warn("Timeout while reading log to end {}", e.getMessage());


How about putting the topic name in the message (like on line 315), denoting that this will be retried automatically, and maybe why this might happen.

Suggested change

log.warn("Timeout while reading log to end {}", e.getMessage());

log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. This may occur when brokers are unavailable or unreachable.", topic, e.getMessage());

Good idea, though looks like this suggested change will exclude the message (doesn't have another {}); I'll fix that.

rhauch · 2019-05-16T10:59:10Z

+    public synchronized void setOffsetsException(KafkaException exception) {
+        this.offsetsException = exception;
+    }
+


Nit: It's probably a bit better to have this be near the setPollException(...) method.

Agreed, not sure why I put it there.

rhauch · 2019-05-16T11:19:09Z

It is a bit surprising it hasn't come up more since AK 2.0 was released, but it would only when brokers become unavailable. It looks like KAFKA-6608 changed the consumer.endOffsets(...) methods in AK 2.0 to add an optional timeout parameter, and for the existing method KafkaBasedLog was using to default to the value set by the request.timeout.ms consumer (client) property, which itself defaults to 30 seconds. This added the possibility of a TimeoutException on these methods, which didn't have it before AK 2.0.

I documented this on KAFKA-7941, but one workaround is to set the request.timeout.ms property for Connect worker's configuration, which affects the worker's consumer used for offsets and other internal topics. Note that doing this will affect the producer of internal components, too, and unless it's overridden for the worker's producer.request.timeout.ms or consumer.request.timeout.ms will also apply to producers and consumers used for connectors.

This fix should be ported back to AK 2.0.

- Enhance logging on TimeoutException - Move setOffsetsException method to sensible place in MockConsumer

pgwhalen · 2019-05-17T01:56:12Z

Thanks for picking up the review @rhauch !

I'm surprised it hasn't come up more as well actually, my team experiences it enough that we went through the trouble to add some additional out of process monitoring specifically to catch this error, and now restart our instance of connect nightly in order to get it back in a good state if it hits this. Regardless, glad we're fixing it now!

For back porting to 2.0, is that something I should help with by making other PRs or will you/others handle it while merging?

rhauch · 2019-05-23T18:39:56Z

@pgwhalen, my comment about backporting was for posterity. PRs should be based on trunk (like this one), and committers will backport to older branches when merging.

rhauch

Thanks, @pgwhalen. One more suggested change, and after that I think we're ready to merge.

rhauch · 2019-05-23T18:42:34Z

    }

-    public synchronized void setException(KafkaException exception) {
-        this.exception = exception;


@pgwhalen, after thinking about this some more, I think we should keep this older signature, deprecate it, and have it call setPollException(exception). Although MockConsumer is strictly speaking a non-public API, doing this will help avoid problems should people use the MockConsumer class in their tests. And it'd probably help to add JavaDoc that says to use setPollException(...) instead.

Good point, I forgot it was a mildly public API. I added it back, should be good to merge now unless anything else stands out.

wicknicks

thanks for the fix, @pgwhalen. one question.

wicknicks · 2019-05-24T17:09:49Z

                        try {
                            readToLogEnd();
                            log.trace("Finished read to end log for topic {}", topic);
+                        } catch (TimeoutException e) {


there are a couple of other places where we call readToEnd() and poll() (links below). should we catch this exception in those places also?

kafka/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

Line 332 in 176ea0d

poll(Integer.MAX_VALUE);

kafka/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

Line 153 in 176ea0d

readToLogEnd();

Thanks for the additional eyes @wicknicks !

The readToLogEnd() call is inside start()so my thinking was that failure would happen on startup which seems okay (if not desirable). If there were broker availability issues it seems reasonable to me for Connect not to start. If we want to try to be more robust we could come up with some retrying strategy, but I bet that would require changes in a number of other places.

The poll() method contains its own catch (KafkaException e) which is a super class of TimeoutException so we ought to be okay there:

kafka/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java

Line 267 in 176ea0d

} catch (KafkaException e) {

wicknicks

thanks, @pgwhalen! lgtm.

jgoldrich · 2019-06-19T03:22:06Z

👍 this affects us, please merge!

adithyant · 2019-06-19T12:07:52Z

can we please merge this ?

rhauch

Thanks, @pgwhalen. This looks good to me, but since we want to backport this all the way back to the 2.0 branch, I've been waiting to merge until we get out of the freeze on the 2.3 branch.

At least approving this. I'll merge this and backport it as soon as we have a vote to release AK 2.3.0.

reshetu · 2019-07-18T00:04:45Z

Wondering what is the timeline for this to be integrated/backported into the production release. It's affecting our current project. Sorry if this is not the right place to ask this question.

pgwhalen · 2019-07-31T03:16:21Z

Giving this one another bump as 2.3 has been out for a bit now.

qinghui-xu · 2019-08-12T21:22:49Z

+                        } catch (TimeoutException e) {
+                            log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. " +
+                                "This may occur when brokers are unavailable or unreachable. Reason: {}", topic, e.getMessage());
+                            continue;


I would suggest to backoff the retry, as the timeout probably means the unavailability of the resource (it might be the kafka broker itself, or the network, or something else), an immediate retry likely will not be success.

In practice, actual retries to the network will still be beholden to the request.timeout.ms and retry.backoff.ms inside the consumer, so that isn't an issue here.

That's true, thanks for refreshing my mind. Then no need to backoff here.

) When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates. Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix. Author: Paul Whalen <pgwhalen@gmail.com> Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>

qinghui-xu · 2019-08-13T21:02:21Z

Thanks for the contribution and the effort of merging it! @pgwhalen @rhauch
I still wonder if my suggestion here could be applied (https://issues.apache.org/jira/browse/KAFKA-8790)? Basically, it's a patch based on this PR, with some worker shutdown mechanism in case the error is unrecoverable (or retry is continuously failing).

What's your thoughts? I think it will be especially helpful when workers' lifecycle is managed by some resource orchestration (such as mesos, k8s, or else), which is our case.
We will be glad to contribute if you guys think it's also valuable for you.

) When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates. Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix. Author: Paul Whalen <pgwhalen@gmail.com> Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>

…aBasedLog worker thread (apache#6283) TICKET = KAFKA-7941 LI_DESCRIPTION = EXIT_CRITERIA = HASH [2426926] ORIGINAL_DESCRIPTION = When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates. Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix. Author: Paul Whalen <pgwhalen@gmail.com> Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com> (cherry picked from commit 2426926)

Fix GlobalStateManagerImplTest after MockConsumer method rename

43c4fd9

rhauch reviewed May 16, 2019

View reviewed changes

Code review tweaks

1c8285a

- Enhance logging on TimeoutException - Move setOffsetsException method to sensible place in MockConsumer

rhauch reviewed May 23, 2019

View reviewed changes

pgwhalen added 2 commits May 23, 2019 22:09

Preserve setException for compatibility

818923c

Call setException rather than replicate

0451311

wicknicks reviewed May 24, 2019

View reviewed changes

wicknicks approved these changes May 29, 2019

View reviewed changes

ryannedolan approved these changes Jun 5, 2019

View reviewed changes

rhauch approved these changes Jun 20, 2019

View reviewed changes

qinghui-xu suggested changes Aug 12, 2019

View reviewed changes

rhauch merged commit e2c8612 into apache:trunk Aug 13, 2019

kkonstantine added the connect label Oct 16, 2020

	log.warn("Timeout while reading log to end {}", e.getMessage());
	log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. This may occur when brokers are unavailable or unreachable.", topic, e.getMessage());

Conversation

pgwhalen commented Feb 18, 2019

Committer Checklist (excluded from commit message)

Uh oh!

pgwhalen commented Feb 18, 2019

Uh oh!

pgwhalen commented Mar 23, 2019

Uh oh!

pgwhalen commented Apr 25, 2019

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhauch commented May 16, 2019

Uh oh!

pgwhalen commented May 17, 2019

Uh oh!

rhauch commented May 23, 2019

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgwhalen May 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

jgoldrich commented Jun 19, 2019

Uh oh!

adithyant commented Jun 19, 2019

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

reshetu commented Jul 18, 2019

Uh oh!

pgwhalen commented Jul 31, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinghui-xu commented Aug 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pgwhalen May 25, 2019 •

edited

Loading