KAFKA-3013#695
Conversation
|
Thanks, this is super helpful! Minor improvement: The error message should probably mention that the batch expired due to timeout while requesting metadata from brokers? i.e. lets not make our users read the code comment to figure out what happened. |
|
I was just thinking that "requesting metadata" might not be the only reason. Suppose we have inflight request set to 1 and somehow the connection between the leader and the producer has some issues (cut cable), but the broker is able to communicate with other brokers in the cluster, so the metadata refresh still returns this broker as the leader. |
|
Could add |
|
@MayureshGharat I am a little confused by your example. Suppose the leader can communicate with other brokers, and other brokers can communicate with producer so that they can provide metadata, then in an IP-based network, the leader should also communicate with the producer, right? |
|
Discussed with @lindong28 offline. We discussed the case where its is possible that the entire cluster is healthy but somehow the connection between the producer and a single broker is messed up. |
140d89f to
930c934
Compare
|
@MayureshGharat Yeah I can not rule out the possibility. But I can not come up with a reasonable scenario that this may happen as well.. It seems an unlikely case if not impossible. Just my opinion. |
|
What was the conclusion with regards to the message? As it is, this PR is already an improvement, so it would be good to include it. If we can be a bit more specific, even better. |
|
@ijuma do you mean we should include the "requesting metadata" part in the message? |
|
@MayureshGharat Yes, I was asking if there was a conclusion on whether we should include that part or not. It would be good to get @gwenshap's input since she asked for it and she would be able to merge this if she's happy with it. :) |
|
@gwenshap : Can you take another look at the patch and the reasoning in this PR? |
|
I disagree that we can run into this issue when metadata is available. Note that we are checking request timeout against lastAttemptMs of the record batch. So as long as we are retrying, we shouldn't hit this type of timeout and expire. If you want to validate: Set up IPFilters block between a producer and a single node in the cluster and check whether you are hitting the "expired" error. Does that make sense? |
|
Cool. I will update the PR. :) |
930c934 to
df33131
Compare
|
@gwenshap Could you take another look? |
|
LGTM (well, I actually suggested a tiny code clean-up as well) |
There was a problem hiding this comment.
Tiny nitpick: TopicPartition.toString does what you are doing manually here, so you could simplify it by just saying:
"Batch containing " + recordCount + " record(s) expired due to timeout while requesting metadata from brokers for " + topicPartition|
@MayureshGharat, unfortunately this branch has conflicts and it needs to be merged/rebased against master. I also left a minor comment on the PR itself. If you address these, we can try and get it merged by a committer. |
df33131 to
edc749d
Compare
|
@gwenshap Can you please merge this? |
|
Done. Sorry for the delay :) |
|
NP. Thanks :) -Mayuresh On Mon, Mar 14, 2016 at 7:23 PM, Gwen (Chen) Shapira <
-Regards, |
Added topic-partition information to the exception message on batch expiry in RecordAccumulator