KAFKA-8040: Streams retry initTransactions on timeout#6372
KAFKA-8040: Streams retry initTransactions on timeout#6372bbejeck merged 1 commit intoapache:trunkfrom vvcephei:KAFKA-8040-streams-retry-initTransactions-2.0
Conversation
There was a problem hiding this comment.
I added the TimeoutException to the error message and as the cause of the StreamsException. This made the lines too long, so I reformatted them.
I also added some more failure modes to the log message; I felt the existing message could be misleading if the problem was actually just a network interruption.
|
@guozhangwang @bbejeck , Can you take a look at this if you get the chance? https://issues.apache.org/jira/browse/KAFKA-8040 I took Guozhang's suggestion from the ticket and kicked off a rebalance if we get a TimeoutException from the init transactions call. This is definitely safe to do, but I'm wondering if it might be too heavy weight. If something as simple as a network hiccough breaks the tcp connection to the broker, the Producer will currently just hang until the "max block" timeout expires and then throw this exception. Under those circumstances, it would be sufficient just to try again. But in this PR, we'll actually do a rebalance, pausing processing and creating a whole bunch more broker API calls until the rebalance completes. If the cause of the timeout was an overloaded or malfunctioning broker, this could actually make the problem worse. I'm wondering if we should instead just retry in a loop just the Is there some aspect to this that I'm missing which would require a rebalance to resolve? Thanks, EDIT: I realized later I was wrong about the rebalance. We'll actually crash. Please see my next comment. |
|
Actually, after looking more closely at the code, I realized that the StreamsException will be fatal. It might have been a mistake, but I thought the intention was to re-try the call. It certainly seems appropriate; WDYT? |
|
Java 8 failures were:
|
|
Retest this, please. |
guozhangwang
left a comment
There was a problem hiding this comment.
Could we add a unit test at StreamTask level to cover this case?
mjsax
left a comment
There was a problem hiding this comment.
From my understanding, we still want to die on a TimeoutException and just improve the error message?
What are the target versions for this fix? Should it go agains 2.2 and we back-port it to 2.1 and 2.0 ? I think we want a better fix for trunk ?
There was a problem hiding this comment.
That does not sound right. If we throw a StreamsException the thread will die.
There was a problem hiding this comment.
Oh, haha, I wrote that when I was confused about what would happen. Good catch!
There was a problem hiding this comment.
It's just how Idea seems to format multiline expressions... e.g.,
1 +
2
instead of
1 +
2
Is it undesirable?
There was a problem hiding this comment.
Not important -- just look funky to me. We concatenate multiple strings and thus all should have the same indent -- why would we indent the first differently? I would format as follows (even if I know that you don't like that the plus goes into the next line):
"string1"
+ "string2"
+ "string3"
This avoid the "ambiguity", of multiple string parameters, vs one concatenated parameter:
method(
"param1",
"param2",
"param3",
"param4");
// vs
method(
"param-part-1" +
"param-part-2" +
"param-part-3",
"new-param");
// vs
method(
"param-part-1"
+ "param-part-2"
+ "param-part-3",
"new-param");
Thirst and second hard hard to distinguish (where do parameters start/stop), but third makes it clear, that it's two parameters but not one or four, what is hard to tell in the middle case. Of course, double indent also fixes this but it's weird to me:
method(
"param-part-1" +
"param-part-2" +
"param-part-3",
"new-param");
There was a problem hiding this comment.
I remember a while back someone on the internet trying to get everyone to always put operators (and commas) at the beginning of the line for this reason. I think it makes sense. I don't think it caught on in general because it creates syntactic ambiguity in Javascript, but since Java requires semicolons to end a line, it should be fine.
Do you have your IDE set up to create this formatting? Maybe it sounds lazy, but the reason I've formatted it this way is that that's what IDEA does by default. I don't want to spend time curating the number of indent spaces by hand on every code change. I couldn't figure out how to get rid of the extra indent in the multi-line string concatenation. Eg, it even does this:
final String s =
"asdf"
+ "qwer"
+ "qwer";which is like the worst outcome.
There was a problem hiding this comment.
I don't know the IDE setting -- this case is rare enough that I "fix" fit manually if it happens.
There was a problem hiding this comment.
Let me know if you plan to address or ignore this -- I am fine either way.
There was a problem hiding this comment.
Ok, I agree with you in principle, but I think I'll just leave it as-is, until I can figure out a way to get the IDE to do it for me.
|
@mjsax , thanks for the review. Yes, the idea is to log a more informative message, but still shut down. I was confused when I wrote the original log message. I thought it would be easier to target 2.0 and forward-port it to 2.1, 2.2, and trunk. (the exception was introduced in 2.0) I'm not sure if we want to do something different on trunk... We can talk about whether Streams should try to stay alive after retriable exceptions, but it's a much larger discussion. In the short term, it seems like we'd just want to merge this change to trunk as well. |
There was a problem hiding this comment.
super nit: extract this expected message into a variable so the test is easier to read?
There was a problem hiding this comment.
You mean, so that it's aligned farther to the left? That's a good suggestion.
Works for me. For this case, the PR should be against |
There was a problem hiding this comment.
nit: fix indention (move each parameter to own line)
|
Java 8 passed, Java 11 failed with retest this please |
|
As requested, I've rebased this PR on trunk. |
mjsax
left a comment
There was a problem hiding this comment.
LGTM. Can be merged if Jenkins passes.
|
Ok! The Jenkins tests have passed. Thanks, @mjsax , @bbejeck , and @guozhangwang ! |
As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
|
cherry-picked to |
* warn-apache-kafka/trunk: (41 commits) MINOR: Avoid double null check in KStream#transform() (apache#6429) KAFKA-7944: Improve Suppress test coverage (apache#6382) KAFKA-3522: add missing guards for TimestampedXxxStore (apache#6356) MINOR: Change Trogdor agent's cleanup executor to a cached thread pool (apache#6309) KAFKA-7976; Update config before notifying controller of unclean leader update (apache#6426) KAFKA-7801: TopicCommand should not be able to alter transaction topic partition count KAFKA-8091; Wait for processor shutdown before testing removed listeners (apache#6425) MINOR: Update delete topics zk path in assertion error messages KAFKA-7939: Fix timing issue in KafkaAdminClientTest.testCreateTopicsRetryBackoff KAFKA-7922: Return authorized operations in Metadata request response (KIP-430 Part-2) MINOR: Print usage when parse fails during console producer MINOR: fix Scala compiler warning (apache#6417) KAFKA-7288; Fix check in SelectorTest to wait for no buffered bytes (apache#6415) KAFKA-8065: restore original input record timestamp in forward() (apache#6393) MINOR: cleanup deprectaion annotations (apache#6290) KAFKA-3522: Add TimestampedWindowStore builder/runtime classes (apache#6173) KAFKA-8069; Fix early expiration of offsets due to invalid loading of expire timestamp (apache#6401) KAFKA-8070: Increase consumer startup timeout in system tests (apache#6405) KAFKA-8040: Streams handle initTransactions timeout (apache#6372) KAFKA-7980 - Fix timing issue in SocketServerTest.testConnectionRateLimit (apache#6391) ...
As of 2.0, Producer.initTransactions may throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
As of 2.0,
Producer.initTransactionsmay throw a TimeoutException, which is retriable. Streams should retry instead of crashing when we encounter this exception.Committer Checklist (excluded from commit message)