KAFKA-3013 by MayureshGharat · Pull Request #695 · apache/kafka

MayureshGharat · 2015-12-18T18:29:50Z

Added topic-partition information to the exception message on batch expiry in RecordAccumulator

gwenshap · 2015-12-18T18:48:15Z

Thanks, this is super helpful!

Minor improvement: The error message should probably mention that the batch expired due to timeout while requesting metadata from brokers? i.e. lets not make our users read the code comment to figure out what happened.

MayureshGharat · 2015-12-18T19:01:22Z

I was just thinking that "requesting metadata" might not be the only reason. Suppose we have inflight request set to 1 and somehow the connection between the leader and the producer has some issues (cut cable), but the broker is able to communicate with other brokers in the cluster, so the metadata refresh still returns this broker as the leader.
The batches sent on wire will be timed out and retried until the max retries. In the mean time, the other batches in the RecordAccumulator might expire due to timeout.

darionyaphet · 2015-12-19T12:56:19Z

Could add recordCount on TimeoutException Message to tell me how many records write failed in this batch ? thank you :)

lindong28 · 2015-12-21T03:04:05Z

@MayureshGharat I am a little confused by your example. Suppose the leader can communicate with other brokers, and other brokers can communicate with producer so that they can provide metadata, then in an IP-based network, the leader should also communicate with the producer, right?

MayureshGharat · 2015-12-22T18:57:20Z

Discussed with @lindong28 offline. We discussed the case where its is possible that the entire cluster is healthy but somehow the connection between the producer and a single broker is messed up.

lindong28 · 2015-12-22T19:11:23Z

@MayureshGharat Yeah I can not rule out the possibility. But I can not come up with a reasonable scenario that this may happen as well.. It seems an unlikely case if not impossible. Just my opinion.

ijuma · 2016-01-06T13:33:19Z

What was the conclusion with regards to the message? As it is, this PR is already an improvement, so it would be good to include it. If we can be a bit more specific, even better.

MayureshGharat · 2016-01-06T17:47:06Z

@ijuma do you mean we should include the "requesting metadata" part in the message?

ijuma · 2016-01-06T18:55:29Z

@MayureshGharat Yes, I was asking if there was a conclusion on whether we should include that part or not. It would be good to get @gwenshap's input since she asked for it and she would be able to merge this if she's happy with it. :)

MayureshGharat · 2016-01-06T21:50:46Z

@ijuma sure, this patch is waiting on @gwenshap's input, if she agrees with the explanation above. I am happy to submit another patch if required.

MayureshGharat · 2016-01-22T21:31:42Z

@gwenshap : Can you take another look at the patch and the reasoning in this PR?

gwenshap · 2016-01-25T20:32:08Z

I disagree that we can run into this issue when metadata is available.

Note that we are checking request timeout against lastAttemptMs of the record batch. So as long as we are retrying, we shouldn't hit this type of timeout and expire.

If you want to validate: Set up IPFilters block between a producer and a single node in the cluster and check whether you are hitting the "expired" error.

Does that make sense?

MayureshGharat · 2016-01-25T21:20:44Z

Cool. I will update the PR. :)

guozhangwang · 2016-01-29T04:48:01Z

@gwenshap Could you take another look?

ijuma · 2016-01-29T08:00:08Z

LGTM (well, I actually suggested a tiny code clean-up as well)

ijuma · 2016-01-29T08:02:10Z

Tiny nitpick: TopicPartition.toString does what you are doing manually here, so you could simplify it by just saying:

"Batch containing " + recordCount + " record(s) expired due to timeout while requesting metadata from brokers for " + topicPartition

ijuma · 2016-02-22T14:18:07Z

@MayureshGharat, unfortunately this branch has conflicts and it needs to be merged/rebased against master. I also left a minor comment on the PR itself. If you address these, we can try and get it merged by a committer.

ijuma · 2016-03-14T14:17:07Z

@gwenshap Can you please merge this?

gwenshap · 2016-03-15T02:23:11Z

Done. Sorry for the delay :)

MayureshGharat · 2016-03-15T15:18:19Z

NP. Thanks :)

-Mayuresh

On Mon, Mar 14, 2016 at 7:23 PM, Gwen (Chen) Shapira <
notifications@github.com> wrote:

Done. Sorry for the delay :)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#695 (comment)

-Regards,
Mayuresh R. Gharat
(862) 250-7125

MayureshGharat force-pushed the kafka-3013 branch from 140d89f to 930c934 Compare December 22, 2015 19:07

MayureshGharat force-pushed the kafka-3013 branch from 930c934 to df33131 Compare January 26, 2016 00:27

ijuma reviewed Jan 29, 2016
View reviewed changes

Rebased with trunk

edc749d

MayureshGharat force-pushed the kafka-3013 branch from df33131 to edc749d Compare March 10, 2016 22:32

asfgit closed this in deb2b00 Mar 15, 2016

Conversation

MayureshGharat commented Dec 18, 2015

Uh oh!

gwenshap commented Dec 18, 2015

Uh oh!

MayureshGharat commented Dec 18, 2015

Uh oh!

darionyaphet commented Dec 19, 2015

Uh oh!

lindong28 commented Dec 21, 2015

Uh oh!

MayureshGharat commented Dec 22, 2015

Uh oh!

lindong28 commented Dec 22, 2015

Uh oh!

ijuma commented Jan 6, 2016

Uh oh!

MayureshGharat commented Jan 6, 2016

Uh oh!

ijuma commented Jan 6, 2016

Uh oh!

MayureshGharat commented Jan 6, 2016

Uh oh!

MayureshGharat commented Jan 22, 2016

Uh oh!

gwenshap commented Jan 25, 2016

Uh oh!

MayureshGharat commented Jan 25, 2016

Uh oh!

guozhangwang commented Jan 29, 2016

Uh oh!

ijuma commented Jan 29, 2016

Uh oh!

ijuma Jan 29, 2016

Choose a reason for hiding this comment

Uh oh!

ijuma commented Feb 22, 2016

Uh oh!

ijuma commented Mar 14, 2016

Uh oh!

gwenshap commented Mar 15, 2016

Uh oh!

MayureshGharat commented Mar 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants