Skip to content

KAFKA-8040: Streams retry initTransactions on timeout#6372

Merged
bbejeck merged 1 commit intoapache:trunkfrom
vvcephei:KAFKA-8040-streams-retry-initTransactions-2.0
Mar 8, 2019
Merged

KAFKA-8040: Streams retry initTransactions on timeout#6372
bbejeck merged 1 commit intoapache:trunkfrom
vvcephei:KAFKA-8040-streams-retry-initTransactions-2.0

Conversation

@vvcephei
Copy link
Copy Markdown
Contributor

@vvcephei vvcephei commented Mar 5, 2019

As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the TimeoutException to the error message and as the cause of the StreamsException. This made the lines too long, so I reformatted them.

I also added some more failure modes to the log message; I felt the existing message could be misleading if the problem was actually just a network interruption.

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 5, 2019

@guozhangwang @bbejeck , Can you take a look at this if you get the chance?

https://issues.apache.org/jira/browse/KAFKA-8040

I took Guozhang's suggestion from the ticket and kicked off a rebalance if we get a TimeoutException from the init transactions call.

This is definitely safe to do, but I'm wondering if it might be too heavy weight. If something as simple as a network hiccough breaks the tcp connection to the broker, the Producer will currently just hang until the "max block" timeout expires and then throw this exception.

Under those circumstances, it would be sufficient just to try again. But in this PR, we'll actually do a rebalance, pausing processing and creating a whole bunch more broker API calls until the rebalance completes. If the cause of the timeout was an overloaded or malfunctioning broker, this could actually make the problem worse.

I'm wondering if we should instead just retry in a loop just the initTransactions call. It seems like this preserves the semantics, which is to block until the call returns.

Is there some aspect to this that I'm missing which would require a rebalance to resolve?

Thanks,
-John

EDIT: I realized later I was wrong about the rebalance. We'll actually crash. Please see my next comment.

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 5, 2019

Actually, after looking more closely at the code, I realized that the StreamsException will be fatal. It might have been a mistake, but I thought the intention was to re-try the call.

It certainly seems appropriate; TimeoutException seems like a retryable exception, and initTransactions doesn't retry internally.

WDYT?

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 5, 2019

Java 8 failures were:

  • org.apache.kafka.common.network.SslSelectorTest.testCloseConnectionInClosingState
  • kafka.api.AdminClientIntegrationTest.testConsumerGroups

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 5, 2019

Retest this, please.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a unit test at StreamTask level to cover this case?

Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, we still want to die on a TimeoutException and just improve the error message?

What are the target versions for this fix? Should it go agains 2.2 and we back-port it to 2.1 and 2.0 ? I think we want a better fix for trunk ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not sound right. If we throw a StreamsException the thread will die.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, haha, I wrote that when I was confused about what would happen. Good catch!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why one more indent ?

Copy link
Copy Markdown
Contributor Author

@vvcephei vvcephei Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just how Idea seems to format multiline expressions... e.g.,

1 +
    2

instead of

1 + 
2

Is it undesirable?

Copy link
Copy Markdown
Member

@mjsax mjsax Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important -- just look funky to me. We concatenate multiple strings and thus all should have the same indent -- why would we indent the first differently? I would format as follows (even if I know that you don't like that the plus goes into the next line):

"string1"
+ "string2"
+ "string3"

This avoid the "ambiguity", of multiple string parameters, vs one concatenated parameter:

method(
    "param1",
    "param2",
    "param3",
    "param4");

// vs

method(
    "param-part-1" +
    "param-part-2" +
    "param-part-3",
    "new-param");

// vs

method(
    "param-part-1"
    + "param-part-2"
    + "param-part-3",
    "new-param");

Thirst and second hard hard to distinguish (where do parameters start/stop), but third makes it clear, that it's two parameters but not one or four, what is hard to tell in the middle case. Of course, double indent also fixes this but it's weird to me:

method(
    "param-part-1" +
        "param-part-2" +
        "param-part-3",
    "new-param");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember a while back someone on the internet trying to get everyone to always put operators (and commas) at the beginning of the line for this reason. I think it makes sense. I don't think it caught on in general because it creates syntactic ambiguity in Javascript, but since Java requires semicolons to end a line, it should be fine.

Do you have your IDE set up to create this formatting? Maybe it sounds lazy, but the reason I've formatted it this way is that that's what IDEA does by default. I don't want to spend time curating the number of indent spaces by hand on every code change. I couldn't figure out how to get rid of the extra indent in the multi-line string concatenation. Eg, it even does this:

        final String s =
            "asdf"
                + "qwer"
                + "qwer";

which is like the worst outcome.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the IDE setting -- this case is rare enough that I "fix" fit manually if it happens.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you plan to address or ignore this -- I am fine either way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree with you in principle, but I think I'll just leave it as-is, until I can figure out a way to get the IDE to do it for me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as above

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 7, 2019

@mjsax , thanks for the review.

Yes, the idea is to log a more informative message, but still shut down. I was confused when I wrote the original log message.

I thought it would be easier to target 2.0 and forward-port it to 2.1, 2.2, and trunk. (the exception was introduced in 2.0)

I'm not sure if we want to do something different on trunk... We can talk about whether Streams should try to stay alive after retriable exceptions, but it's a much larger discussion. In the short term, it seems like we'd just want to merge this change to trunk as well.

Copy link
Copy Markdown
Member

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @vvcephei just one minor comment LGTM modulo comments from @mjsax.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: extract this expected message into a variable so the test is easier to read?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, so that it's aligned farther to the left? That's a good suggestion.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Mar 7, 2019

I thought it would be easier to target 2.0 and forward-port it to 2.1, 2.2, and trunk. (the exception was introduced in 2.0)

Works for me. For this case, the PR should be against trunk and we cherry-pick to older branches (this is the common pattern).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix indention (move each parameter to own line)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above.

@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Mar 7, 2019

Java 8 passed, Java 11 failed with kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll

retest this please

@vvcephei vvcephei changed the base branch from 2.0 to trunk March 7, 2019 23:00
@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 7, 2019

As requested, I've rebased this PR on trunk.

Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can be merged if Jenkins passes.

@vvcephei
Copy link
Copy Markdown
Contributor Author

vvcephei commented Mar 8, 2019

Ok! The Jenkins tests have passed. Thanks, @mjsax , @bbejeck , and @guozhangwang !

@bbejeck bbejeck merged commit f716d08 into apache:trunk Mar 8, 2019
@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Mar 8, 2019

Merged #6372 to trunk

Thanks @vvcephei!

@vvcephei vvcephei deleted the KAFKA-8040-streams-retry-initTransactions-2.0 branch March 8, 2019 16:24
bbejeck pushed a commit that referenced this pull request Mar 8, 2019
As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>,  Bill Bejeck <bbejeck@gmail.com>
bbejeck pushed a commit that referenced this pull request Mar 8, 2019
As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>,  Bill Bejeck <bbejeck@gmail.com>
@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented Mar 8, 2019

cherry-picked to 2.1 and 2.2 will merge separate PR for 2.0

jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* warn-apache-kafka/trunk: (41 commits)
  MINOR: Avoid double null check in KStream#transform() (apache#6429)
  KAFKA-7944: Improve Suppress test coverage (apache#6382)
  KAFKA-3522: add missing guards for TimestampedXxxStore (apache#6356)
  MINOR: Change Trogdor agent's cleanup executor to a cached thread pool (apache#6309)
  KAFKA-7976; Update config before notifying controller of unclean leader update (apache#6426)
  KAFKA-7801: TopicCommand should not be able to alter transaction topic partition count
  KAFKA-8091; Wait for processor shutdown before testing removed listeners (apache#6425)
  MINOR: Update delete topics zk path in assertion error messages
  KAFKA-7939: Fix timing issue in KafkaAdminClientTest.testCreateTopicsRetryBackoff
  KAFKA-7922: Return authorized operations in Metadata request response (KIP-430 Part-2)
  MINOR: Print usage when parse fails during console producer
  MINOR: fix Scala compiler warning (apache#6417)
  KAFKA-7288; Fix check in SelectorTest to wait for no buffered bytes (apache#6415)
  KAFKA-8065: restore original input record timestamp in forward() (apache#6393)
  MINOR: cleanup deprectaion annotations (apache#6290)
  KAFKA-3522: Add TimestampedWindowStore builder/runtime classes (apache#6173)
  KAFKA-8069; Fix early expiration of offsets due to invalid loading of expire timestamp (apache#6401)
  KAFKA-8070: Increase consumer startup timeout in system tests (apache#6405)
  KAFKA-8040: Streams handle initTransactions timeout (apache#6372)
  KAFKA-7980 - Fix timing issue in SocketServerTest.testConnectionRateLimit (apache#6391)
  ...
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception

Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>,  Bill Bejeck <bbejeck@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants