Skip to content

KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread#6283

Merged
rhauch merged 5 commits intoapache:trunkfrom
pgwhalen:kafka-based-log-timeout-exception-fix
Aug 13, 2019
Merged

KAFKA-7941: Catch TimeoutException in KafkaBasedLog worker thread#6283
rhauch merged 5 commits intoapache:trunkfrom
pgwhalen:kafka-based-log-timeout-exception-fix

Conversation

@pgwhalen
Copy link
Copy Markdown
Contributor

  • When calling readLogToEnd(), the KafkaBasedLog worker thread should
    catch TimeoutException and log a warning, which can occur if brokers
    are unavailable, otherwise the worker thread terminates.
  • Includes an enhancement to MockConsumer that allows simulating
    exceptions not just when polling but also when querying for offsets,
    which is necessary for testing the fix.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

 - When calling readLogToEnd(), the KafkaBasedLog worker thread should
 catch TimeoutException and log a warning, which can occur if brokers
 are unavailable, otherwise the worker thread terminates.
 - Includes an enhancement to MockConsumer that allows simulating
 exceptions not just when polling but also when querying for offsets,
 which is necessary for testing the fix.
@pgwhalen
Copy link
Copy Markdown
Contributor Author

This is my first contribution and I haven't exactly poured over the entirety of the Connect runtime, so I'll be curious to see if I'm missing something entirely.

A few things I'd like reviewers to note that I thought about myself:

  • Is WARN an appropriate log level, or should there even be a log at all?
  • Given that the problem is essentially "uncaught exception causes important thread to terminate," should we "catch and continue" for yet more exceptions to prevent similar issues?
  • More broadly - Is there a way to alert on the healthiness of Connect overall, perhaps through the REST interface? Maybe the biggest issue with the bug was that it was hard to detect - tasks still reported themselves as "running," meaning the only way to discover the worker thread had died was by monitoring the expected output of our connectors (and noticing that there was none).

@pgwhalen
Copy link
Copy Markdown
Contributor Author

retest this please

@pgwhalen
Copy link
Copy Markdown
Contributor Author

Ping @mjsax @kkonstantine @wicknicks @rhauch (not totally sure, but I've seen these folks reviewing other connect PRs)

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pgwhalen -- nice catch! I think this was not considered an issue is because the internal poll(Integer.MAX_VALUE) call in readToEnd() should never time out. But as you pointed out, there are other method invocations, and timeouts on those should be handled as well.

I have a minor suggestion to help make the log message more readable, and (if you're in the code) another to move the two setter methods on MockConsumer together. Otherwise, this looks great.

readToLogEnd();
log.trace("Finished read to end log for topic {}", topic);
} catch (TimeoutException e) {
log.warn("Timeout while reading log to end {}", e.getMessage());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting the topic name in the message (like on line 315), denoting that this will be retried automatically, and maybe why this might happen.

Suggested change
log.warn("Timeout while reading log to end {}", e.getMessage());
log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. This may occur when brokers are unavailable or unreachable.", topic, e.getMessage());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, though looks like this suggested change will exclude the message (doesn't have another {}); I'll fix that.

public synchronized void setOffsetsException(KafkaException exception) {
this.offsetsException = exception;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's probably a bit better to have this be near the setPollException(...) method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, not sure why I put it there.

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 16, 2019

It is a bit surprising it hasn't come up more since AK 2.0 was released, but it would only when brokers become unavailable. It looks like KAFKA-6608 changed the consumer.endOffsets(...) methods in AK 2.0 to add an optional timeout parameter, and for the existing method KafkaBasedLog was using to default to the value set by the request.timeout.ms consumer (client) property, which itself defaults to 30 seconds. This added the possibility of a TimeoutException on these methods, which didn't have it before AK 2.0.

I documented this on KAFKA-7941, but one workaround is to set the request.timeout.ms property for Connect worker's configuration, which affects the worker's consumer used for offsets and other internal topics. Note that doing this will affect the producer of internal components, too, and unless it's overridden for the worker's producer.request.timeout.ms or consumer.request.timeout.ms will also apply to producers and consumers used for connectors.

This fix should be ported back to AK 2.0.

 - Enhance logging on TimeoutException
 - Move setOffsetsException method to sensible place in MockConsumer
@pgwhalen
Copy link
Copy Markdown
Contributor Author

Thanks for picking up the review @rhauch !

I'm surprised it hasn't come up more as well actually, my team experiences it enough that we went through the trouble to add some additional out of process monitoring specifically to catch this error, and now restart our instance of connect nightly in order to get it back in a good state if it hits this. Regardless, glad we're fixing it now!

For back porting to 2.0, is that something I should help with by making other PRs or will you/others handle it while merging?

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented May 23, 2019

@pgwhalen, my comment about backporting was for posterity. PRs should be based on trunk (like this one), and committers will backport to older branches when merging.

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pgwhalen. One more suggested change, and after that I think we're ready to merge.

}

public synchronized void setException(KafkaException exception) {
this.exception = exception;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgwhalen, after thinking about this some more, I think we should keep this older signature, deprecate it, and have it call setPollException(exception). Although MockConsumer is strictly speaking a non-public API, doing this will help avoid problems should people use the MockConsumer class in their tests. And it'd probably help to add JavaDoc that says to use setPollException(...) instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I forgot it was a mildly public API. I added it back, should be good to merge now unless anything else stands out.

Copy link
Copy Markdown
Contributor

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix, @pgwhalen. one question.

try {
readToLogEnd();
log.trace("Finished read to end log for topic {}", topic);
} catch (TimeoutException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of other places where we call readToEnd() and poll() (links below). should we catch this exception in those places also?

Copy link
Copy Markdown
Contributor Author

@pgwhalen pgwhalen May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional eyes @wicknicks !

The readToLogEnd() call is inside start()so my thinking was that failure would happen on startup which seems okay (if not desirable). If there were broker availability issues it seems reasonable to me for Connect not to start. If we want to try to be more robust we could come up with some retrying strategy, but I bet that would require changes in a number of other places.

The poll() method contains its own catch (KafkaException e) which is a super class of TimeoutException so we ought to be okay there:

Copy link
Copy Markdown
Contributor

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @pgwhalen! lgtm.

@jgoldrich
Copy link
Copy Markdown

👍 this affects us, please merge!

@adithyant
Copy link
Copy Markdown

can we please merge this ?

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pgwhalen. This looks good to me, but since we want to backport this all the way back to the 2.0 branch, I've been waiting to merge until we get out of the freeze on the 2.3 branch.

At least approving this. I'll merge this and backport it as soon as we have a vote to release AK 2.3.0.

@reshetu
Copy link
Copy Markdown

reshetu commented Jul 18, 2019

Wondering what is the timeline for this to be integrated/backported into the production release. It's affecting our current project. Sorry if this is not the right place to ask this question.

@pgwhalen
Copy link
Copy Markdown
Contributor Author

Giving this one another bump as 2.3 has been out for a bit now.

} catch (TimeoutException e) {
log.warn("Timeout while reading log to end for topic '{}'. Retrying automatically. " +
"This may occur when brokers are unavailable or unreachable. Reason: {}", topic, e.getMessage());
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to backoff the retry, as the timeout probably means the unavailability of the resource (it might be the kafka broker itself, or the network, or something else), an immediate retry likely will not be success.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, actual retries to the network will still be beholden to the request.timeout.ms and retry.backoff.ms inside the consumer, so that isn't an issue here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, thanks for refreshing my mind. Then no need to backoff here.

@rhauch rhauch merged commit e2c8612 into apache:trunk Aug 13, 2019
rhauch pushed a commit that referenced this pull request Aug 13, 2019
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
rhauch pushed a commit that referenced this pull request Aug 13, 2019
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
rhauch pushed a commit that referenced this pull request Aug 13, 2019
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
@qinghui-xu
Copy link
Copy Markdown
Contributor

Thanks for the contribution and the effort of merging it! @pgwhalen @rhauch
I still wonder if my suggestion here could be applied (https://issues.apache.org/jira/browse/KAFKA-8790)? Basically, it's a patch based on this PR, with some worker shutdown mechanism in case the error is unrecoverable (or retry is continuously failing).

What's your thoughts? I think it will be especially helpful when workers' lifecycle is managed by some resource orchestration (such as mesos, k8s, or else), which is our case.
We will be glad to contribute if you guys think it's also valuable for you.

rhauch pushed a commit that referenced this pull request Aug 13, 2019
)

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
xiowu0 pushed a commit to linkedin/kafka that referenced this pull request Aug 22, 2019
…aBasedLog worker thread (apache#6283)

TICKET = KAFKA-7941
LI_DESCRIPTION =
EXIT_CRITERIA = HASH [2426926]
ORIGINAL_DESCRIPTION =

When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.

Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.

Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
(cherry picked from commit 2426926)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants