KAFKA-3111: Fix ConsumerPerformance reporting to use time-based instead of message-based intervals#788
KAFKA-3111: Fix ConsumerPerformance reporting to use time-based instead of message-based intervals#788vahidhashemian wants to merge 1 commit intoapache:trunkfrom
Conversation
|
Thanks for the patch. Thinking a bit more. It seems the main issue is that the reporting interval is based on messages, instead of time. Reporting the stats at some fixed time interval is more intuitive and is also used in org.apache.kafka.tools.ProducerPerformance. So, to be consistent, we probably should just change the reporting interval to time. |
|
Thanks for the feedback. If I understand correctly your suggestion is to change the implementation of the argument "reporting-interval" so instead of number of messages in each interval it takes the (perhaps minimum) length of intervals in milliseconds. It seems that this change could impact both kafka.tools.ConsumerPerformance and kafka.tools.ProducerPerformance as they both use kafka.tools.PerfConfig to extract the args. Do we want kafka.tools.ProducerPerformance modified to time-based intervals too? |
34c54e9 to
e16254b
Compare
|
I made some minor changes so the reporting interval for ConsumerPerformance becomes time-based. Although kafka.tools.ProducerPerformance indirectly uses PerfConfig it does not implement similar interval reporting and will be unchanged. |
e16254b to
91aa54f
Compare
There was a problem hiding this comment.
We don't need to change the method printProgressMessage() if we require currentTimeMillis - lastReportTime >= config.reportingInterval, right?
There was a problem hiding this comment.
Could you please clarify the question? This method is called only when currentTimeMillis - lastReportTime >= config.reportingInterval (and we want a detailed report).
There was a problem hiding this comment.
Thanks for the reply.
Here you changed the code from
1000.0 * (mbRead / elapsedMs), messagesRead, ((messagesRead - lastMessagesRead) / elapsedMs) * 1000.0))
to
val mbReadPerSec =
if ((endMs == startMs) && (mbRead == 0.0))
0.0
else
1000.0 * (mbRead / elapsedMs)
val nMsgPerSec =
if ((endMs == startMs) && (messagesRead == lastMessagesRead))
0.0
else
((messagesRead - lastMessagesRead) / elapsedMs) * 1000.0
The reason is because we want to handle the case where endMs == startMs. But since we will only call this method when currentTimeMillis - lastReportTime >= config.reportingInterval is true, then it is guaranteed that endMs > startMs, thus making the code change unnecessary, right?
There was a problem hiding this comment.
Besides, printProgressMessage is also called at line 144. Should that logic be changed as well so that it is called at given time interval?
There was a problem hiding this comment.
Thank you for your feedback again.
To answer your first comment, There doesn't seem to be any constraint on the value of reporting-interval. I can simply pass in the value of 0 (or a negative value) and no validation is performed. So the added checks might prove useful. Unless we validate the value given for reporting-interval. I hope I didn't misunderstand your point.
On your second comment, you are right. I'll update the other usage as well.
There was a problem hiding this comment.
@vahidhashemian I agree with you that currently reporting-interval may be 0 or even negative. But does such value even make sense for reporting-interval? In my opinion, we should either throw exception, or disable printMessage until the end of experiment, if reporting-interval <=0. What do you think?
There was a problem hiding this comment.
@lindong28 Throwing exceptions sounds good. I don't see any validation for other argument values either. Not sure why. For example, the messages argument could be validated too (to be >= 0).
There was a problem hiding this comment.
@vahidhashemian This code is written in late 2011 and developers at that time probably doesn't spend much time on such issue in tools and instead focus on implementing new features, since they are probably the only users of such tools. Recent committers are enforcing much high standards and they will probably ask such questions when reviewing your code. I myself have been forced to make my patch bullet-proof to any question like this before submitting it for review.
|
@lindong28 Thank you for your feedback. I responded to your comments inline. |
5230a55 to
4c7d8dc
Compare
|
@lindong28 I updated the PR based on the feedback and the discussion. |
|
Thanks @vahidhashemian, LGTM. @ijuma Do you have time to review this patch? |
There was a problem hiding this comment.
Could we avoid the System.currentTimeMillis() call in line 129?
There was a problem hiding this comment.
Sure and a nice catch. One option is to make currentTimeMillis a variable and initially assign it before the while loop at line 129, and then re-assign the current time to it inside the while loop (to replace the current val definition at line 140). I'll update the PR.
477d8d9 to
e5fab1e
Compare
|
@junrao Thank you for your feedback. I updated the PR based on your suggestions. |
There was a problem hiding this comment.
For reporting purpose, I think it's enough to call System.currentTimeMillis once per while loop, instead of once per record. Also, the patch changed the behavior of the tool a bit. Currently, if there are no new messages, the tool will exit after 1 sec. With the patch, the tool will block forever. We will need to revert to the previous behavior.
Finally, since our system tests use this tool. Could you run our system tests (see test/README.md) and make sure that the patch doesn't break any test? If needed, @granders can give you access to our jenkins job so that you can test your branch on EC2.
There was a problem hiding this comment.
@junrao Thanks again for your feedback. To respond to your comments:
- Sure, I'll update that for the new consumer. It seems to me the old consumer is OK.
- Right, I see it happening for the new consumer only, and it's because we don't refresh the value of
currentTimeMillisvariable in case theforloop doesn't run. Actually, the fix for the issue (1) above will fix this one too. - I'll try to run the system tests locally, but my resources are limited. In any case, @granders would you please give me access so I can run these tests on EC2? Thanks.
9105d09 to
6a890c1
Compare
|
I made the changes discussed. Just waiting for EC2 access so I can run system tests. |
|
@vahidhashemian I already triggered a run of system tests on your branch here: |
|
@granders Thank you. Appreciate the quick help. |
|
No problem @vahidhashemian ! |
|
@junrao Looks like the system tests ran without an issue. They failed the first time on one test perhaps due to a transient failure. |
|
@junrao It's been a while since the PR was updated and passed system tests. Do you see anything further that needs to be addressed? Thanks. |
There was a problem hiding this comment.
Could we move this line to before line 132? That will save a System.currentTimeMillis call in 132.
There was a problem hiding this comment.
Sure, I hope I got it right in the updated patch.
|
@vahidhashemian : Sorry for the delay. Looks good. Just had a minor comment. |
…ad of message-based intervals
6a890c1 to
a5b56f1
Compare
|
@junrao Thanks for taking another look. I updated the patch as you suggested. Do you think another system test is required? |
|
Thanks for the patch. LGTM |
|
Hi, I am using kafka 2.11_1.1.0 and still facing this issue. My understanding is this patch is available in kafka 1.1.0 version. Please let me know if still I still need to pull. Later I got it :: |
|
@Meghajnct, this PR has been in the past few releases; but at least from the stack trace you shared, it doesn't seem the issue you're experiencing is related to this PR. I'd suggest you open a JIRA ticket and explain the issue in as much details as possible. Thanks! |
Interval lengths for ConsumerPerformance could sometime be calculated as zero. In such cases, when the bytes read or messages read are also zero a NaN output is returned for mbRead per second or for nMsg per second, whereas zero would be a more appropriate output.
In cases where interval length is zero but there have been data and messages to read, an output of Infinity is returned, as expected.