Skip to content

KAFKA-14953: Add tiered storage related metrics#13944

Merged
satishd merged 9 commits intoapache:trunkfrom
abhijeetk88:KAFKA-14953
Jul 18, 2023
Merged

KAFKA-14953: Add tiered storage related metrics#13944
satishd merged 9 commits intoapache:trunkfrom
abhijeetk88:KAFKA-14953

Conversation

@abhijeetk88
Copy link
Copy Markdown
Contributor

@abhijeetk88 abhijeetk88 commented Jul 1, 2023

In this PR, I have added the following metrics that are related to tiered storage mentioned in KIP-405.

Metric Description
RemoteReadRequestsPerSec Number of remote storage read requests per second
RemoteWriteRequestsPerSec Number of remote storage write requests per second
RemoteBytesInPerSec Number of bytes read from remote storage per second
RemoteReadErrorsPerSec Number of remote storage read errors per second
RemoteBytesOutPerSec Number of bytes copied to remote storage per second
RemoteWriteErrorsPerSec Number of remote storage write errors per second
RemoteLogReaderTaskQueueSize Number of remote storage read tasks pending for execution
RemoteLogReaderAvgIdlePercent Average idle percent of the remote storage reader thread pool
RemoteLogManagerTasksAvgIdlePercent Average idle percent of RemoteLogManager thread pool

Added unit tests for all the rate metrics.

I will have a follow-up PR to add these metrics to Kafka documentation.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@abhijeetk88 abhijeetk88 changed the title [DRAFT PR - WIP] KAFKA-14953: Adding RemoteLogManager metrics [DRAFT] KAFKA-14953: Adding RemoteLogManager metrics Jul 1, 2023
@abhijeetk88 abhijeetk88 changed the title [DRAFT] KAFKA-14953: Adding RemoteLogManager metrics [DRAFT] KAFKA-14953: Add tiered storage related metrics Jul 1, 2023
@abhijeetk88 abhijeetk88 marked this pull request as draft July 1, 2023 05:36
@showuon
Copy link
Copy Markdown
Member

showuon commented Jul 1, 2023

@abhijeetk88 , please let me know when PR is ready for review. Thanks for the work on weekend! :)

@abhijeetk88 abhijeetk88 changed the title [DRAFT] KAFKA-14953: Add tiered storage related metrics KAFKA-14953: Add tiered storage related metrics Jul 3, 2023
@abhijeetk88 abhijeetk88 marked this pull request as ready for review July 3, 2023 11:16
@satishd satishd self-requested a review July 3, 2023 11:18
@abhijeetk88
Copy link
Copy Markdown
Contributor Author

@showuon The PR is ready for review. Please take a look.

@showuon
Copy link
Copy Markdown
Member

showuon commented Jul 3, 2023

Thanks @abhijeetk88 , will review it later this week.

@showuon showuon added the tiered-storage Related to the Tiered Storage feature label Jul 3, 2023
Copy link
Copy Markdown
Member

@divijvaidya divijvaidya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this change! I left a couple of comments. Let me know what you think about them.

Comment thread core/src/main/java/kafka/log/remote/RemoteLogManager.java Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add Remote metrics only when RemoteStorage is enabled on the cluster. Otherwise these these per-topic metrics are useless.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a significant amount of changes. We need to know in this class if remote storage is enabled and for that to happen BrokerTopicStats needs to know that. There are many instances of object creation for BrokerTopicStats all of which will need to supply whether remote storage is enabled.

Should we address this in a separate PR? Let me know if you have better ideas to do this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. Please open a JIRA ticket for this improvement. Thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find the JIRA, so when ahead created on https://issues.apache.org/jira/browse/KAFKA-15189 . This JIRA is a blocker for 3.6 launch otherwise it will leak to regression for customers not using the Remote Storage feature,

Comment thread core/src/test/java/kafka/log/remote/RemoteLogReaderTest.java Outdated
Comment thread core/src/main/java/kafka/log/remote/RemoteLogManager.java Outdated
Comment thread core/src/test/java/kafka/log/remote/RemoteLogReaderTest.java Outdated
Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeetk88 , thanks for the PR. Made a pass, overall LGTM!
But I didn't see the tests for RemoteLogReaderTaskQueueSize, RemoteLogReaderAvgIdlePercent, and RemoteLogManagerTasksAvgIdlePercent. Could you help add them? Thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@abhijeetk88
Copy link
Copy Markdown
Contributor Author

@showuon adding tests for those gauge metrics requires significant changes. We will need to obtain a handle for the underlying threadpool, submit tasks to the threadpool, make sure they don't progress, and then verify the gauges for non-zero values.

Shall we add those tests in a follow-up PR? Let me know if you have better ideas for these tests.

@satishd
Copy link
Copy Markdown
Member

satishd commented Jul 10, 2023

@abhijeetk88 @showuon Let us have a followup JIRA/PR on adding more test for the metrics.

@showuon
Copy link
Copy Markdown
Member

showuon commented Jul 10, 2023

@abhijeetk88 @showuon Let us have a followup JIRA/PR on adding more test for the metrics.

Sure, please open a JIRA ticket to track it. Thanks.

Comment on lines +282 to +285
BrokerTopicStats.RemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteReadRequestsPerSec, "requests"),
BrokerTopicStats.RemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteWriteRequestsPerSec, "requests"),
BrokerTopicStats.FailedRemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteReadRequestsPerSec, "requests"),
BrokerTopicStats.FailedRemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteWriteRequestsPerSec, "requests"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more naming suggestions:

Suggested change
BrokerTopicStats.RemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteReadRequestsPerSec, "requests"),
BrokerTopicStats.RemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteWriteRequestsPerSec, "requests"),
BrokerTopicStats.FailedRemoteReadRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteReadRequestsPerSec, "requests"),
BrokerTopicStats.FailedRemoteWriteRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteWriteRequestsPerSec, "requests"),
BrokerTopicStats.RemoteFetchRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteFetchRequestsPerSec, "requests"),
BrokerTopicStats.RemoteCopyRequestsPerSec -> MeterWrapper(BrokerTopicStats.RemoteCopyRequestsPerSec, "requests"),
BrokerTopicStats.FailedRemoteCopyRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteFetchRequestsPerSec, "requests"),
BrokerTopicStats.FailedRemoteCopyRequestsPerSec -> MeterWrapper(BrokerTopicStats.FailedRemoteCopyRequestsPerSec, "requests"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the KIP version was voted with the earlier names and now changing the KIP would be equivalent to breaking the agreed contract when the KIP was accepted. I will propose to keep the metric names as they were in the KIP. If we want to introduce new ones, we can start a new KIP which deprecates these metrics and uses new ones.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suspecting this would be the case. Though as metrics haven't been implemented yet, would it be possible to amend the KIP and ask for feedback?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We at (Amazon MSK) have an implementation against the KIP-405 interfaces [1] and we explicitly tell the customers that the metric such as RemoteBytesInPerSec are as per KIP-405 contract. If we change the KIP now, it would be a break in contract for our customers. Hence, I would not be in favour of amending the accepted KIP at this time.

[1] see: RemoteBytesInPerSec at https://docs.aws.amazon.com/msk/latest/developerguide/metrics-details.html

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid having multiple versions of metrics here. As it is not yet released in Apache, we should have better naming agreeable for the majority. We do not need to have two versions here when it is released in Apache. It avoids confusion for users too. We can discuss whether that can be done by updating KIP-405(I guess I did update d a couple of them) or we can have a minor KIP for them.

@divijvaidya This small change can be handled by organizations that adopted KIP-405.

@showuon
Copy link
Copy Markdown
Member

showuon commented Jul 11, 2023

@abhijeetk88 @showuon Let us have a followup JIRA/PR on adding more test for the metrics.

Sure, please open a JIRA ticket to track it. Thanks.

KAFKA-15176 is opened.

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @abhijeetk88 , could you resolve the conflict?
@divijvaidya @jeqo , any other comments to this PR?

Copy link
Copy Markdown
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeetk88 for the PR. Overall LGTM. But we can have better naming for these metrics. That can be done in a followup PR once we decide on the names.

Copy link
Copy Markdown
Contributor

@jeqo jeqo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @abhijeetk88 !

Copy link
Copy Markdown
Member

@divijvaidya divijvaidya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for you patience so far Abhijeet. I have one last round of questions/comments. I promise to provide rapid review replies (on weekdays) to wrap up this PR soon.

Utils.closeQuietly(indexCache, "RemoteIndexCache");

rlmScheduledThreadPool.close();
removeMetrics();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be done in a try/finally after the thread pool (in the next line) has been shutdown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

KafkaMetricsGroup mockThreadPoolMetricsGroup = mockMetricsGroupCtor.constructed().get(1);

List<String> remoteLogManagerMetricNames = Collections.singletonList(REMOTE_LOG_MANAGER_TASKS_AVG_IDLE_PERCENT);
List<String> remoteStorageThreadPoolMetricNames = Arrays.asList(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add a metric to the RemoteLogManager or to the thread pool, we will have to update this test. Alternatively, may I suggest, the pattern we have used at other places, i.e. create a list of metrics in the class where metric group is present, and verify invocations in the test for all members of that class. With this pattern, you won't have to modify the test at all when adding new metrics.

As an example, you can see

private[log] val MetricNames = Set(

Separately, I also like the pattern of metric decoupling introduced used by QuorumControllerMetrics. It neatly encapsulates all QuorumController metrics at one place and we can potentially do similar for RemoteLogManager. This is a suggestion and feel free to ignore this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Comment thread core/src/main/java/kafka/log/remote/RemoteLogManager.java Outdated
BrokerTopicStats.TotalFetchRequestsPerSec -> MeterWrapper(BrokerTopicStats.TotalFetchRequestsPerSec, "requests"),
BrokerTopicStats.FetchMessageConversionsPerSec -> MeterWrapper(BrokerTopicStats.FetchMessageConversionsPerSec, "requests"),
BrokerTopicStats.ProduceMessageConversionsPerSec -> MeterWrapper(BrokerTopicStats.ProduceMessageConversionsPerSec, "requests"),
BrokerTopicStats.RemoteCopyBytesPerSec -> MeterWrapper(BrokerTopicStats.RemoteCopyBytesPerSec, "bytes"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that for some other metrics, we store aggregated topics stat using allTopicsStats. Are we intentionally not add remote* metrics to it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added those now.


}

public void removeMetrics() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am planning to clean up all the leaked metrics in the current base code recently, so I will pay close attention to the "metric", Is it possible to follow this template here? It is better to put the metric name in a collection to prevent others from forgetting to remove when adding a new metric. Otherwise, my work in metric cleaning this area may be endless. @abhijeetk88

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@abhijeetk88
Copy link
Copy Markdown
Contributor Author

Addressed all the comments. Please take a look.

@showuon
Copy link
Copy Markdown
Member

showuon commented Jul 17, 2023

@abhijeetk88 , there are errors from spotbugs. Could you take a look?
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-13944/7/pipeline

Copy link
Copy Markdown
Member

@divijvaidya divijvaidya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience through this PR @abhijeetk88 (and congratulations of your first contribution to Apache Kafka!)

The changes look good to me. I am waiting for CI to reach an acceptable state now.

Copy link
Copy Markdown
Contributor

@kamalcph kamalcph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, came across this PR and have a qq, why we don't include the OffsetOutOfRangeException result into the failed remote read request rate metrics? If for some reason we can not, can we have a separate metrics to capture that error in metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to include OffsetOutOfRangeException because it is not really a failure in remote reads.

We can have a separate discussion on a different metric since that change is unrelated to this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will create a JIRA to track that discussion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divijvaidya
Copy link
Copy Markdown
Member

This has 33 failures, we are currently have 8-9 flaky test failures in other PRs on trunk. i am re-running the tests.

@satishd
Copy link
Copy Markdown
Member

satishd commented Jul 18, 2023

After the jenkins rebuild again, there are ~7 test failures that are unrelated to this PR. Merging it to trunk.

@satishd satishd merged commit fd3b113 into apache:trunk Jul 18, 2023
jolshan added a commit to confluentinc/kafka that referenced this pull request Jul 18, 2023
* ak/trunk: (110 commits)
  MINOR: Update docs to include ZK deprecation notice and information (apache#14031)
  KAFKA-15091: Fix misleading Javadoc for SourceTask::commit (apache#13948)
  KAFKA-14669: Use the generated docs for MirrorMaker configs in the doc (apache#13658)
  KAFKA-14953: Add tiered storage related metrics (apache#13944)
  KAFKA-15121: Implement the alterOffsets method in the FileStreamSourceConnector and the FileStreamSinkConnector (apache#13945)
  Revert "MINOR: Update .asf.yaml file with refreshed github_whitelist, and collaborators" (apache#14037)
  MINOR: Update .asf.yaml file with refreshed github_whitelist, and collaborators
  KAFKA-14737: Move kafka.utils.json to server-common (apache#13585)
  KAFKA-14647: Move TopicFilter to server-common/utils (apache#13158)
  MINOR: remove unused variable in examples (apache#14021)
  ...
jeqo pushed a commit to jeqo/kafka that referenced this pull request Jul 21, 2023
* KAFKA-14953: Adding RemoteLogManager metrics
In this PR, I have added the following metrics that are related to tiered storage mentioned in[ KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage).
|Metric|Description|
|-----------------------------------------|--------------------------------------------------------------|
| RemoteReadRequestsPerSec                    | Number of remote storage read requests per second               |
| RemoteWriteRequestsPerSec                    | Number of remote storage write requests per second              |
| RemoteBytesInPerSec                                | Number of bytes read from remote storage per second           |
| RemoteReadErrorsPerSec                          | Number of remote storage read errors per second                   |
| RemoteBytesOutPerSec                             | Number of bytes copied to remote storage per second            |
| RemoteWriteErrorsPerSec                          | Number of remote storage write errors per second                   |
| RemoteLogReaderTaskQueueSize             | Number of remote storage read tasks pending for execution.  |
| RemoteLogReaderAvgIdlePercent             | Average idle percent of the remote storage reader thread pool|
| RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool          |

Added unit tests for all the rate metrics.

Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com>, Staniel Yao<yaolixinylx@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana <satishd@apache.org>
Cerchie pushed a commit to Cerchie/kafka that referenced this pull request Jul 25, 2023
* KAFKA-14953: Adding RemoteLogManager metrics
In this PR, I have added the following metrics that are related to tiered storage mentioned in[ KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage).
|Metric|Description|
|-----------------------------------------|--------------------------------------------------------------|
| RemoteReadRequestsPerSec                    | Number of remote storage read requests per second               |
| RemoteWriteRequestsPerSec                    | Number of remote storage write requests per second              |
| RemoteBytesInPerSec                                | Number of bytes read from remote storage per second           |
| RemoteReadErrorsPerSec                          | Number of remote storage read errors per second                   |
| RemoteBytesOutPerSec                             | Number of bytes copied to remote storage per second            |
| RemoteWriteErrorsPerSec                          | Number of remote storage write errors per second                   |
| RemoteLogReaderTaskQueueSize             | Number of remote storage read tasks pending for execution.  |
| RemoteLogReaderAvgIdlePercent             | Average idle percent of the remote storage reader thread pool|
| RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool          |

Added unit tests for all the rate metrics.

Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com>, Staniel Yao<yaolixinylx@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana <satishd@apache.org>
jeqo pushed a commit to aiven/kafka that referenced this pull request Aug 15, 2023
* KAFKA-14953: Adding RemoteLogManager metrics
In this PR, I have added the following metrics that are related to tiered storage mentioned in[ KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage).
|Metric|Description|
|-----------------------------------------|--------------------------------------------------------------|
| RemoteReadRequestsPerSec                    | Number of remote storage read requests per second               |
| RemoteWriteRequestsPerSec                    | Number of remote storage write requests per second              |
| RemoteBytesInPerSec                                | Number of bytes read from remote storage per second           |
| RemoteReadErrorsPerSec                          | Number of remote storage read errors per second                   |
| RemoteBytesOutPerSec                             | Number of bytes copied to remote storage per second            |
| RemoteWriteErrorsPerSec                          | Number of remote storage write errors per second                   |
| RemoteLogReaderTaskQueueSize             | Number of remote storage read tasks pending for execution.  |
| RemoteLogReaderAvgIdlePercent             | Average idle percent of the remote storage reader thread pool|
| RemoteLogManagerTasksAvgIdlePercent | Average idle percent of RemoteLogManager thread pool          |

Added unit tests for all the rate metrics.

Reviewers: Luke Chen <showuon@gmail.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>, Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com>, Staniel Yao<yaolixinylx@gmail.com>, hudeqi<1217150961@qq.com>, Satish Duggana <satishd@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tiered-storage Related to the Tiered Storage feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants