Skip to content

Conversation

@frankxieke
Copy link
Contributor

Motivation

Currently, there is no offload metrics for tiered storage, so it is very hard for us to debug the performance issues. For example , we can not find why offload is slow or why read offload is slow. For above reasons. we need to add some offload metrics for monitoring.

Modifications

Add metrics during offload procedure and read offload data procedure. Including offloadTime, offloadError, offloadRate, readLedgerLatency, writeStoreLatency, writeStoreError, readOffloadIndexLatency, readOffloadDataLatency, readOffloadRate, readOffloadError.

Documentation

need documentation.

@codelipenghui codelipenghui added this to the 2.9.0 milestone Aug 4, 2021
@codelipenghui codelipenghui added doc-required Your PR changes impact docs and you will update later. area/tieredstorage area/metrics labels Aug 4, 2021
@codelipenghui codelipenghui requested a review from zymap August 4, 2021 15:35
@Anonymitaet Anonymitaet added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels Aug 5, 2021
@Anonymitaet
Copy link
Member

Anonymitaet commented Aug 5, 2021

@frankxieke should we add metrics explanations here https://pulsar.apache.org/docs/en/next/reference-metrics/ ?

If so, would you like to add docs accordingly? Then you can ping me to review, thanks

@Anonymitaet
Copy link
Member

Confirmed w/ @frankxieke, he will add docs and ping me review after this PR is finalized.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankxieke - this will be a good addition to our metrics. Before we merge this PR, I would like to see these metrics behind a feature flag that defaults to disabled at the topic level. Perhaps we can use the config exposeManagedLedgerMetricsInPrometheus, which already exists, as the feature flag?

@michaeljmarshall
Copy link
Member

@lhotari - I know you have contributed optimizations to the way Pulsar generates Prometheus metrics in the past. Would you mind taking a look at this PR?

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulsar can enable/disable the namespace/topic level metrics. so we should aggregate for different levels, if namespace/topic level disabled we only expose the broker level metrics, if namespace enabled but topic level disabled, we only expose the namespace level aggregate value.

@frankxieke frankxieke force-pushed the offload_performance_metrics branch 2 times, most recently from 09864d6 to 7a6bdba Compare August 12, 2021 06:25
@hangc0276
Copy link
Contributor

I doubt why not choose provide OpStatsLogger (pulsar-broker/src/main/java/org/apache/pulsar/broker/stats/prometheus/metrics/DataSketchesOpStatsLogger.java) to generate metrics instead of LedgerOffloaderMXBeanImpl.

An demo of using OpStatsLogger is streamnative/kop#389

@frankxieke frankxieke force-pushed the offload_performance_metrics branch 9 times, most recently from 302dbd0 to b589d95 Compare August 25, 2021 05:34
@frankxieke frankxieke force-pushed the offload_performance_metrics branch 2 times, most recently from c4090c7 to f4024ce Compare August 25, 2021 05:56
@eolivelli eolivelli modified the milestones: 2.9.0, 2.10.0 Oct 6, 2021
@frankxieke frankxieke force-pushed the offload_performance_metrics branch from f4024ce to a636fbd Compare November 2, 2021 08:10
@frankxieke
Copy link
Contributor Author

rerun failure checks

@frankxieke frankxieke force-pushed the offload_performance_metrics branch from d4c1689 to 334bc66 Compare November 30, 2021 01:24
@frankxieke
Copy link
Contributor Author

rerun failure checks

@frankxieke
Copy link
Contributor Author

rerun failure checks

rerun failure checks

@codelipenghui codelipenghui modified the milestones: 2.10.0, 2.11.0 Jan 18, 2022
@Anonymitaet Anonymitaet removed the doc-required Your PR changes impact docs and you will update later. label Jan 18, 2022
@michaeljmarshall michaeljmarshall mentioned this pull request Jan 20, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants