KAFKA-14953: Add tiered storage related metrics#13944
Conversation
|
@abhijeetk88 , please let me know when PR is ready for review. Thanks for the work on weekend! :) |
|
@showuon The PR is ready for review. Please take a look. |
|
Thanks @abhijeetk88 , will review it later this week. |
divijvaidya
left a comment
There was a problem hiding this comment.
Thank you for making this change! I left a couple of comments. Let me know what you think about them.
There was a problem hiding this comment.
We should probably add Remote metrics only when RemoteStorage is enabled on the cluster. Otherwise these these per-topic metrics are useless.
There was a problem hiding this comment.
This requires a significant amount of changes. We need to know in this class if remote storage is enabled and for that to happen BrokerTopicStats needs to know that. There are many instances of object creation for BrokerTopicStats all of which will need to supply whether remote storage is enabled.
Should we address this in a separate PR? Let me know if you have better ideas to do this.
There was a problem hiding this comment.
Fine. Please open a JIRA ticket for this improvement. Thanks.
There was a problem hiding this comment.
I wasn't able to find the JIRA, so when ahead created on https://issues.apache.org/jira/browse/KAFKA-15189 . This JIRA is a blocker for 3.6 launch otherwise it will leak to regression for customers not using the Remote Storage feature,
There was a problem hiding this comment.
@abhijeetk88 , thanks for the PR. Made a pass, overall LGTM!
But I didn't see the tests for RemoteLogReaderTaskQueueSize, RemoteLogReaderAvgIdlePercent, and RemoteLogManagerTasksAvgIdlePercent. Could you help add them? Thanks.
|
@showuon adding tests for those gauge metrics requires significant changes. We will need to obtain a handle for the underlying threadpool, submit tasks to the threadpool, make sure they don't progress, and then verify the gauges for non-zero values. Shall we add those tests in a follow-up PR? Let me know if you have better ideas for these tests. |
|
@abhijeetk88 @showuon Let us have a followup JIRA/PR on adding more test for the metrics. |
Sure, please open a JIRA ticket to track it. Thanks. |
| BrokerTopicStats.RemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteReadRequestsPerSec, "requests"), | ||
| BrokerTopicStats.RemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteWriteRequestsPerSec, "requests"), | ||
| BrokerTopicStats.FailedRemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteReadRequestsPerSec, "requests"), | ||
| BrokerTopicStats.FailedRemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteWriteRequestsPerSec, "requests"), |
There was a problem hiding this comment.
few more naming suggestions:
| BrokerTopicStats.RemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteReadRequestsPerSec, "requests"), | |
| BrokerTopicStats.RemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteWriteRequestsPerSec, "requests"), | |
| BrokerTopicStats.FailedRemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteReadRequestsPerSec, "requests"), | |
| BrokerTopicStats.FailedRemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteWriteRequestsPerSec, "requests"), | |
| BrokerTopicStats.RemoteFetchRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteFetchRequestsPerSec, "requests"), | |
| BrokerTopicStats.RemoteCopyRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteCopyRequestsPerSec, "requests"), | |
| BrokerTopicStats.FailedRemoteCopyRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteFetchRequestsPerSec, "requests"), | |
| BrokerTopicStats.FailedRemoteCopyRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteCopyRequestsPerSec, "requests"), |
There was a problem hiding this comment.
Note that the KIP version was voted with the earlier names and now changing the KIP would be equivalent to breaking the agreed contract when the KIP was accepted. I will propose to keep the metric names as they were in the KIP. If we want to introduce new ones, we can start a new KIP which deprecates these metrics and uses new ones.
There was a problem hiding this comment.
I was suspecting this would be the case. Though as metrics haven't been implemented yet, would it be possible to amend the KIP and ask for feedback?
There was a problem hiding this comment.
We at (Amazon MSK) have an implementation against the KIP-405 interfaces [1] and we explicitly tell the customers that the metric such as RemoteBytesInPerSec are as per KIP-405 contract. If we change the KIP now, it would be a break in contract for our customers. Hence, I would not be in favour of amending the accepted KIP at this time.
[1] see: RemoteBytesInPerSec at https://docs.aws.amazon.com/msk/latest/developerguide/metrics-details.html
There was a problem hiding this comment.
We should avoid having multiple versions of metrics here. As it is not yet released in Apache, we should have better naming agreeable for the majority. We do not need to have two versions here when it is released in Apache. It avoids confusion for users too. We can discuss whether that can be done by updating KIP-405(I guess I did update d a couple of them) or we can have a minor KIP for them.
@divijvaidya This small change can be handled by organizations that adopted KIP-405.
KAFKA-15176 is opened. |
showuon
left a comment
There was a problem hiding this comment.
LGTM! @abhijeetk88 , could you resolve the conflict?
@divijvaidya @jeqo , any other comments to this PR?
satishd
left a comment
There was a problem hiding this comment.
Thanks @abhijeetk88 for the PR. Overall LGTM. But we can have better naming for these metrics. That can be done in a followup PR once we decide on the names.
jeqo
left a comment
There was a problem hiding this comment.
LGTM, thanks @abhijeetk88 !
divijvaidya
left a comment
There was a problem hiding this comment.
Thank you for you patience so far Abhijeet. I have one last round of questions/comments. I promise to provide rapid review replies (on weekdays) to wrap up this PR soon.
| Utils.closeQuietly(indexCache, "RemoteIndexCache"); | ||
|
|
||
| rlmScheduledThreadPool.close(); | ||
| removeMetrics(); |
There was a problem hiding this comment.
This should probably be done in a try/finally after the thread pool (in the next line) has been shutdown.
| KafkaMetricsGroup mockThreadPoolMetricsGroup = mockMetricsGroupCtor.constructed().get(1); | ||
|
|
||
| List<String> remoteLogManagerMetricNames = Collections.singletonList(REMOTE_LOG_MANAGER_TASKS_AVG_IDLE_PERCENT); | ||
| List<String> remoteStorageThreadPoolMetricNames = Arrays.asList( |
There was a problem hiding this comment.
When we add a metric to the RemoteLogManager or to the thread pool, we will have to update this test. Alternatively, may I suggest, the pattern we have used at other places, i.e. create a list of metrics in the class where metric group is present, and verify invocations in the test for all members of that class. With this pattern, you won't have to modify the test at all when adding new metrics.
As an example, you can see
Separately, I also like the pattern of metric decoupling introduced used by QuorumControllerMetrics. It neatly encapsulates all QuorumController metrics at one place and we can potentially do similar for RemoteLogManager. This is a suggestion and feel free to ignore this.
| BrokerTopicStats.TotalFetchRequestsPerSec -> MeterWrapper(BrokerTopicStats.TotalFetchRequestsPerSec, "requests"), | ||
| BrokerTopicStats.FetchMessageConversionsPerSec -> MeterWrapper(BrokerTopicStats.FetchMessageConversionsPerSec, "requests"), | ||
| BrokerTopicStats.ProduceMessageConversionsPerSec -> MeterWrapper(BrokerTopicStats.ProduceMessageConversionsPerSec, "requests"), | ||
| BrokerTopicStats.RemoteCopyBytesPerSec -> MeterWrapper(BrokerTopicStats.RemoteCopyBytesPerSec, "bytes"), |
There was a problem hiding this comment.
Please note that for some other metrics, we store aggregated topics stat using allTopicsStats. Are we intentionally not add remote* metrics to it?
There was a problem hiding this comment.
I have added those now.
|
|
||
| } | ||
|
|
||
| public void removeMetrics() { |
There was a problem hiding this comment.
Hi, I am planning to clean up all the leaked metrics in the current base code recently, so I will pay close attention to the "metric", Is it possible to follow this template here? It is better to put the metric name in a collection to prevent others from forgetting to remove when adding a new metric. Otherwise, my work in metric cleaning this area may be endless. @abhijeetk88
|
Addressed all the comments. Please take a look. |
|
@abhijeetk88 , there are errors from spotbugs. Could you take a look? |
divijvaidya
left a comment
There was a problem hiding this comment.
Thank you for your patience through this PR @abhijeetk88 (and congratulations of your first contribution to Apache Kafka!)
The changes look good to me. I am waiting for CI to reach an acceptable state now.
kamalcph
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the PR!
There was a problem hiding this comment.
Hi, came across this PR and have a qq, why we don't include the OffsetOutOfRangeException result into the failed remote read request rate metrics? If for some reason we can not, can we have a separate metrics to capture that error in metrics?
There was a problem hiding this comment.
We do not want to include OffsetOutOfRangeException because it is not really a failure in remote reads.
We can have a separate discussion on a different metric since that change is unrelated to this PR.
There was a problem hiding this comment.
Thank you. I will create a JIRA to track that discussion.
There was a problem hiding this comment.
|
This has 33 failures, we are currently have 8-9 flaky test failures in other PRs on trunk. i am re-running the tests. |
|
After the jenkins rebuild again, there are ~7 test failures that are unrelated to this PR. Merging it to trunk. |
* ak/trunk: (110 commits) MINOR: Update docs to include ZK deprecation notice and information (apache#14031) KAFKA-15091: Fix misleading Javadoc for SourceTask::commit (apache#13948) KAFKA-14669: Use the generated docs for MirrorMaker configs in the doc (apache#13658) KAFKA-14953: Add tiered storage related metrics (apache#13944) KAFKA-15121: Implement the alterOffsets method in the FileStreamSourceConnector and the FileStreamSinkConnector (apache#13945) Revert "MINOR: Update .asf.yaml file with refreshed github_whitelist, and collaborators" (apache#14037) MINOR: Update .asf.yaml file with refreshed github_whitelist, and collaborators KAFKA-14737: Move kafka.utils.json to server-common (apache#13585) KAFKA-14647: Move TopicFilter to server-common/utils (apache#13158) MINOR: remove unused variable in examples (apache#14021) ...
* KAFKA-14953: Adding RemoteLogManager metrics In this PR, I have added the following metrics that are related to tiered storage mentioned in[ KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage). |Metric|Description| |-----------------------------------------|--------------------------------------------------------------| | RemoteReadRequestsPerSec | Number of remote storage read requests per second | | RemoteWriteRequestsPerSec | Number of remote storage write requests per second | | RemoteBytesInPerSec | Number of bytes read from remote storage per second | | RemoteReadErrorsPerSec | Number of remote storage read errors per second | | RemoteBytesOutPerSec | Number of bytes copied to remote storage per second | | RemoteWriteErrorsPerSec | Number of remote storage write errors per second | | RemoteLogReaderTaskQueueSize | Number of remote storage read tasks pending for execution. | | RemoteLogReaderAvgIdlePercent | Average idle percent of the remote storage reader thread pool| | RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool | Added unit tests for all the rate metrics. Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com>, Staniel Yao<yaolixinylx@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana <satishd@apache.org>
* KAFKA-14953: Adding RemoteLogManager metrics In this PR, I have added the following metrics that are related to tiered storage mentioned in[ KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage). |Metric|Description| |-----------------------------------------|--------------------------------------------------------------| | RemoteReadRequestsPerSec | Number of remote storage read requests per second | | RemoteWriteRequestsPerSec | Number of remote storage write requests per second | | RemoteBytesInPerSec | Number of bytes read from remote storage per second | | RemoteReadErrorsPerSec | Number of remote storage read errors per second | | RemoteBytesOutPerSec | Number of bytes copied to remote storage per second | | RemoteWriteErrorsPerSec | Number of remote storage write errors per second | | RemoteLogReaderTaskQueueSize | Number of remote storage read tasks pending for execution. | | RemoteLogReaderAvgIdlePercent | Average idle percent of the remote storage reader thread pool| | RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool | Added unit tests for all the rate metrics. Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com>, Staniel Yao<yaolixinylx@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana <satishd@apache.org>
* KAFKA-14953: Adding RemoteLogManager metrics In this PR, I have added the following metrics that are related to tiered storage mentioned in[ KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage). |Metric|Description| |-----------------------------------------|--------------------------------------------------------------| | RemoteReadRequestsPerSec | Number of remote storage read requests per second | | RemoteWriteRequestsPerSec | Number of remote storage write requests per second | | RemoteBytesInPerSec | Number of bytes read from remote storage per second | | RemoteReadErrorsPerSec | Number of remote storage read errors per second | | RemoteBytesOutPerSec | Number of bytes copied to remote storage per second | | RemoteWriteErrorsPerSec | Number of remote storage write errors per second | | RemoteLogReaderTaskQueueSize | Number of remote storage read tasks pending for execution. | | RemoteLogReaderAvgIdlePercent | Average idle percent of the remote storage reader thread pool| | RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool | Added unit tests for all the rate metrics. Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com>, Staniel Yao<yaolixinylx@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana <satishd@apache.org>
In this PR, I have added the following metrics that are related to tiered storage mentioned in KIP-405.
Added unit tests for all the rate metrics.
I will have a follow-up PR to add these metrics to Kafka documentation.
Committer Checklist (excluded from commit message)