Skip to content

Conversation

@marksilcox
Copy link
Contributor

Fixes #8407
Fixes #13865

Motivation

Current broker prometheus metrics are not grouped by metric type which causes issues in systems that read these metrics (e.g. DataDog).

Prometheus docs states "All lines for a given metric must be provided as one single group" - https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting

Modifications
Updated the namespace and topic prometheus metric generators to group the metrics under the appropriate type header.
Updated function worker stats to include TYPE headers

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

Added unit test to verify all metrics are grouped under correct type header

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Need to update docs?

  • doc-not-needed
    Changes to match prometheus spec

@marksilcox
Copy link
Contributor Author

/pulsarbot run-failure-checks

@codelipenghui codelipenghui merged commit bc69cfb into apache:branch-2.9 Sep 20, 2022
codelipenghui added a commit that referenced this pull request Sep 24, 2022
@codelipenghui
Copy link
Contributor

@marksilcox Sorry, I have to revert this PR since it introduced a memory leak that was detected by a long-running continuous verification test.

The reproduction steps:

  1. Start a standalone
  2. Create a partitioned topic (100 partitions)
  3. Start producer bin/pulsar-perf produce test -r 100 -bm 1 -mk random
  4. Start consumer bin/pulsar-perf consume -st Key_Shared -r 100 -n 50 test
  5. Access the metrics endpoint
for i in {1..1000}
do
curl -L localhost:8080/metrics/
done

image

After reverting this PR, the memory leak issue has been fixed.

image

I'm not sure where is the root cause yet. Since many people are built based on the Pulsar release branches, so we'd better revert first and then fix the memory issue and create a new PR again. @marksilcox If you need help with the memory leak issue, I believe @tjiuming or @asafm can provide some insight here.

@codelipenghui
Copy link
Contributor

codelipenghui commented Sep 26, 2022

I can't reproduce the issue on the master branch with PR (#15558); it looks like the memory leak issue only happened on branch-2.9

But it's better to have someone to double check.

@asafm
Copy link
Contributor

asafm commented Sep 29, 2022

@codelipenghui The implementation between 2.9 and master are indeed different, hence the reason why the memory leak was introduced only in 2.9 - I checked it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants