Skip to content

KAFKA-3013#695

Closed
MayureshGharat wants to merge 1 commit into
apache:trunkfrom
MayureshGharat:kafka-3013
Closed

KAFKA-3013#695
MayureshGharat wants to merge 1 commit into
apache:trunkfrom
MayureshGharat:kafka-3013

Conversation

@MayureshGharat
Copy link
Copy Markdown
Contributor

Added topic-partition information to the exception message on batch expiry in RecordAccumulator

@gwenshap
Copy link
Copy Markdown
Contributor

Thanks, this is super helpful!

Minor improvement: The error message should probably mention that the batch expired due to timeout while requesting metadata from brokers? i.e. lets not make our users read the code comment to figure out what happened.

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

I was just thinking that "requesting metadata" might not be the only reason. Suppose we have inflight request set to 1 and somehow the connection between the leader and the producer has some issues (cut cable), but the broker is able to communicate with other brokers in the cluster, so the metadata refresh still returns this broker as the leader.
The batches sent on wire will be timed out and retried until the max retries. In the mean time, the other batches in the RecordAccumulator might expire due to timeout.

@darionyaphet
Copy link
Copy Markdown
Contributor

Could add recordCount on TimeoutException Message to tell me how many records write failed in this batch ? thank you :)

@lindong28
Copy link
Copy Markdown
Member

@MayureshGharat I am a little confused by your example. Suppose the leader can communicate with other brokers, and other brokers can communicate with producer so that they can provide metadata, then in an IP-based network, the leader should also communicate with the producer, right?

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

Discussed with @lindong28 offline. We discussed the case where its is possible that the entire cluster is healthy but somehow the connection between the producer and a single broker is messed up.

@lindong28
Copy link
Copy Markdown
Member

@MayureshGharat Yeah I can not rule out the possibility. But I can not come up with a reasonable scenario that this may happen as well.. It seems an unlikely case if not impossible. Just my opinion.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 6, 2016

What was the conclusion with regards to the message? As it is, this PR is already an improvement, so it would be good to include it. If we can be a bit more specific, even better.

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

@ijuma do you mean we should include the "requesting metadata" part in the message?

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 6, 2016

@MayureshGharat Yes, I was asking if there was a conclusion on whether we should include that part or not. It would be good to get @gwenshap's input since she asked for it and she would be able to merge this if she's happy with it. :)

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

@ijuma sure, this patch is waiting on @gwenshap's input, if she agrees with the explanation above. I am happy to submit another patch if required.

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

@gwenshap : Can you take another look at the patch and the reasoning in this PR?

@gwenshap
Copy link
Copy Markdown
Contributor

I disagree that we can run into this issue when metadata is available.

Note that we are checking request timeout against lastAttemptMs of the record batch. So as long as we are retrying, we shouldn't hit this type of timeout and expire.

If you want to validate: Set up IPFilters block between a producer and a single node in the cluster and check whether you are hitting the "expired" error.

Does that make sense?

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

Cool. I will update the PR. :)

@guozhangwang
Copy link
Copy Markdown
Contributor

@gwenshap Could you take another look?

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jan 29, 2016

LGTM (well, I actually suggested a tiny code clean-up as well)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nitpick: TopicPartition.toString does what you are doing manually here, so you could simplify it by just saying:

"Batch containing " + recordCount + " record(s) expired due to timeout while requesting metadata from brokers for " + topicPartition

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 22, 2016

@MayureshGharat, unfortunately this branch has conflicts and it needs to be merged/rebased against master. I also left a minor comment on the PR itself. If you address these, we can try and get it merged by a committer.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Mar 14, 2016

@gwenshap Can you please merge this?

@asfgit asfgit closed this in deb2b00 Mar 15, 2016
@gwenshap
Copy link
Copy Markdown
Contributor

Done. Sorry for the delay :)

@MayureshGharat
Copy link
Copy Markdown
Contributor Author

NP. Thanks :)

-Mayuresh

On Mon, Mar 14, 2016 at 7:23 PM, Gwen (Chen) Shapira <
notifications@github.com> wrote:

Done. Sorry for the delay :)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#695 (comment)

-Regards,
Mayuresh R. Gharat
(862) 250-7125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants