KAFKA-6446: KafkaProducer should use timed version of `await` to avoid endless waiting by huxihx · Pull Request #4563 · apache/kafka

huxihx · 2018-02-13T08:43:54Z

https://issues.apache.org/jira/browse/KAFKA-6446

Replaced await() with timed version to avoid endless waiting and refined the code to have Sender thread able to exit from infinitely connecting the bad broker.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…trap server is down https://issues.apache.org/jira/browse/KAFKA-6446 Replaced await() with timed version to avoid endless waiting and refined the code to have Sender thread able to exit from infinitely connecting the `bad` broker.

huxihx · 2018-02-13T08:44:31Z

@apurvam Please kindly review. Thanks.

apurvam · 2018-02-13T18:43:00Z

        sender.wakeup();
-        result.await();
+        try {
+            if (!result.await(requestTimeoutMs, TimeUnit.MILLISECONDS)) {


I think we should also transition to a fatal error state if we can't init successfully. This will ensure that the only other operation you can do is close. We should also add a test case to ensure that operations other than close are not allowed if initTransactions failed.

@apurvam Do you mean we should throw a fatal error instead of a TimeoutException?

We should probably use the producer's max block time instead of the request timeout.

huxihx · 2018-03-02T05:44:34Z

@apurvam Thanks for the comments. Please review again.

huxihx · 2018-03-05T05:51:47Z

@becketqin Please take some time to review this patch. Thanks.

huxihx · 2018-03-09T02:23:39Z

@hachikuji Could you help review this patch? Thanks.

hachikuji

Thanks for the patch. Left a few comments.

hachikuji · 2018-03-10T21:27:10Z

        sender.wakeup();
-        result.await();
+        try {
+            if (!result.await(requestTimeoutMs, TimeUnit.MILLISECONDS)) {


We should probably use the producer's max block time instead of the request timeout.

hachikuji · 2018-03-10T21:30:29Z

+            if (!result.await(requestTimeoutMs, TimeUnit.MILLISECONDS)) {
+                transactionManager.transitionToFatalError(
+                        new TimeoutException("Timeout expired while initializing the transaction in " + requestTimeoutMs + "ms."));
+                throw new FatalExitError();


I don't think this this is what we want. We're not using an Error anywhere else in the producer. I'd suggest we just throw TimeoutException, but it is a RetriableException, which would be misleading if we do not allow retrying. We could either introduce a FatalTimeoutException, or we could try to make this API safe to retry. For example, to implement the latter, we could cache the result object so that on retry, we continue waiting for it.

IMO, the retry here should be implemented by the users instead of the producer itself, right? If this is the case, can we simply call initTransactions again after we catch the thrown TimeoutException. In doing so can we avoid to cache the result object. Does it make sense?

Yes, I am suggesting that we allow the user to retry after a timeout. The simplest way to do so is to cache the result object so that we do not send another InitProducerId request. Instead, we should just continue waiting on the one that we already sent.

Please confirm: if the retry happens outside initTransactions, in order to return the cached result, we'll change the signature of this method which might need a formal-approving process. However, if we do the retry from within initTransactions, we have to figure out the total retry count ahead of time which still needs a formal discussion. Am I correct?

Let me clarify what I meant. In TransactionManager.initializeTransactions, we return a TransactionalRequestResult, which we wait on from initTransactions(). What I am suggesting is that we could cache the instance of TransactionalRequestResult inside TransactionManager; if initTransactions() times out and is invoked again, we can just continue waiting on the same result object. So it does not change the API.

hachikuji · 2018-03-10T21:30:54Z

+        try {
+            if (!result.await(requestTimeoutMs, TimeUnit.MILLISECONDS)) {
+                transactionManager.transitionToFatalError(
+                        new TimeoutException("Timeout expired while initializing the transaction in " + requestTimeoutMs + "ms."));


This should probably say "Timeout expired while initializing transactional state ..."

hachikuji · 2018-03-10T21:40:08Z

                    client.send(clientRequest, now);
                    return true;
                }
+                break; // break the loop if we failed to find a specific node


I am not sure we always want to do break if we cannot find a node to send the request to. Other than the bootstrapping case, we typically just want to refresh metadata and try again. I think I'd suggest we leave this logic as is, but add a condition to the loop to check for producer shutdown.

… FatorExitError and changed to a conditional loop

huxihx · 2018-03-13T06:12:09Z

@hachikuji Please review again. Thanks.

hachikuji

Thanks, one more comment.

hachikuji · 2018-03-15T14:59:13Z

-        InitProducerIdHandler handler = new InitProducerIdHandler(builder);
-        enqueueRequest(handler);
-        return handler.result;
+        if (transactionalRequestResult == null || transactionalRequestResult.isCompleted()) {


In the current code, calling initializeTransactions more than once causes an illegal state. I think we should preserve that semantic in spite of retries after timeouts. In other words, the only situation where you're allowed to retry a call to initializeTransactions is after a timeout. Once it returns successfully for the user, we go back to the current behavior and raise an illegal state.

We're almost there with this patch, but the user could see the illegal state before a successful call because of the isCompleted check here. To implement the semantics we want, I think we may have to do the following:

Move the caching of the initTransactions result into KafkaProducer.

On the first invocation, we cache the result.

If there is a timeout, we continue waiting on the result.

Once the result completes successfully, we should return and set the result to null.

Once the cached result is cleared, if the user tries to initializeTransactions again, they will get the current illegal state error, which is what we want.

Moving the cache result to KafkaProducer has to change the signature of initTransactions which might need a KIP to have it discussed. Is that true?

What I mean is that we would create a private field for the cached result in KafkaProducer, so the signature does not change. The reason I am suggesting we move it there is that that is where we are doing the wait, which means we know when the operation has been successfully completed from the user's perspective.

hachikuji

Thanks for the updates. Looking good, but had a couple more comments.

hachikuji · 2018-03-17T17:31:33Z

    private final ProducerInterceptors<K, V> interceptors;
    private final ApiVersions apiVersions;
    private final TransactionManager transactionManager;
+    private TransactionalRequestResult transactionalRequestResult;


nit: we only use this for initializeTransactions, so maybe the name could be more specific? Say initTransactionsResult?

hachikuji · 2018-03-20T15:42:29Z

     * @throws org.apache.kafka.common.errors.AuthorizationException fatal error indicating that the configured
     *         transactional.id is not authorized. See the exception for more details
     * @throws KafkaException if the producer has encountered a previous fatal error or for any other unexpected error
+     * @throws TimeoutException if the time taken for initialize the transaction has surpassed <code>max.block.ms</code>.


Can you add a comment to the javadoc mentioning that this method may be retried if a TimeoutException or an InterruptException is raised?

hachikuji · 2018-03-20T15:49:02Z

+  private def createTransactionalProducerToConnectNonExistentBrokers(): KafkaProducer[Array[Byte], Array[Byte]] = {
+    val props = new Properties()
+    props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "test-transaction-id")
+    val producer = TestUtils.createNewProducer(brokerList = "192.168.1.1:9092", maxBlockMs = 1000,


Is there a way we can test this without depending on an IP directly like this? I don't think we can assume that the builds will always be sandboxed. I actually think KafkaProducerTest might be a better place since it lets us mock the network layer.

huxihx · 2018-03-26T00:56:07Z

retest it please

hachikuji

Thanks for the patch, LGTM. I made a few minor tweaks to the new test cases so that they depended on a mocked network (and ran faster). There are a few additional test paths to hit, but they are a little harder. I will try to address them in an upcoming PR.

apurvam reviewed Feb 13, 2018

View reviewed changes

ewencp added the producer label Feb 28, 2018

address apurvam's comments

ea2c702

hachikuji self-assigned this Mar 6, 2018

hachikuji reviewed Mar 10, 2018

View reviewed changes

huxi-2b added 3 commits March 12, 2018 11:26

address Jason's comments including: throw TimeoutException instead of…

0c99432

… FatorExitError and changed to a conditional loop

addressed Jason's comments on caching result.

9802bf3

merge latest code

5bc2b09

hachikuji reviewed Mar 15, 2018

View reviewed changes

addressed Jason's comments.

3269b26

hachikuji reviewed Mar 20, 2018

View reviewed changes

addressed Jason's comments.

e0efaba

Minor tweaks of test cases in KafkaProducerTest

dc2633d

hachikuji approved these changes Mar 27, 2018

View reviewed changes

Improve javadoc for initTransactions

754249e

hachikuji merged commit 9eb32ea into apache:trunk Mar 27, 2018

hachikuji mentioned this pull request Apr 9, 2018

KAFKA-6768; Transactional producer may hang in close with pending requests #4842

Merged

3 tasks

Conversation

huxihx commented Feb 13, 2018

Committer Checklist (excluded from commit message)

Uh oh!

huxihx commented Feb 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huxihx commented Mar 2, 2018

Uh oh!

huxihx commented Mar 5, 2018

Uh oh!

huxihx commented Mar 9, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huxihx commented Mar 13, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huxihx commented Mar 26, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants