Skip to content

admin: Add support for prometheus summary metrics#30479

Merged
jmarantz merged 36 commits intoenvoyproxy:mainfrom
andybradshaw:ab/add-prometheus-summary-metrics
Apr 1, 2024
Merged

admin: Add support for prometheus summary metrics#30479
jmarantz merged 36 commits intoenvoyproxy:mainfrom
andybradshaw:ab/add-prometheus-summary-metrics

Conversation

@andybradshaw
Copy link
Copy Markdown
Contributor

Commit Message: Add support for prometheus summary metrics on the admin endpoint
Additional Description: Adds support emitting prometheus "summary" metrics for the internal histogram quantiles by supplying a query parameter. Multiple modes are supported, as in #25812, and can be either histogram, summary, or histogram,summary.
Risk Level: Low, no changes to existing default behavior
Testing: Added unit tests for histogram, summary, and summary+histogram emission
Docs Changes: Added documentation to the admin home page, and to the published admin docs around an optional query parameter.
Release Notes: Added a note in the small_feature section.

Fixes #30471

@repokitteh-read-only
Copy link
Copy Markdown

Hi @andybradshaw, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #30479 was opened by andybradshaw.

see: more, trace.

Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
@andybradshaw andybradshaw force-pushed the ab/add-prometheus-summary-metrics branch from 39b4ab3 to 664eb0d Compare October 25, 2023 03:19
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Comment thread source/server/admin/prometheus_stats.cc Outdated
Comment thread docs/root/operations/admin.rst Outdated
@ggreenway ggreenway self-assigned this Oct 25, 2023
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Comment thread source/server/admin/admin.cc Outdated
Comment thread source/server/admin/stats_params.h Outdated
@jmarantz
Copy link
Copy Markdown
Contributor

/wait

…ermine summary emission

Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Comment thread changelogs/current.yaml Outdated
are reported. Supported modes are ``histogram`` and ``summary``, referring to the corresponding prometheus metric
types.
The ``/stats/prometheus`` endpoint can now emit prometheus ``summary`` metric types by explicitly setting the
``histogram_buckets`` query parameter to ``none``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for renaming "none" to "summary" in this PR, leaving 'none' as a synonym for "summary".

Comment thread docs/root/operations/admin.rst Outdated
of the text readout stat changes, which could create an unbounded number of time series.

.. http:get:: /stats?format=prometheus&histogram_emit_mode=histogram,summary
.. http:get:: /stats?format=prometheus&histogram_buckets=none
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For text and json, all 4 histogram_buckets modes are supported, and currently for prom, that is ignored, and in this PR you are adding support for 'summary'.

Right now you don't have 'detailed' or 'cumulative' supported for Prom and I'd consider that tech-debt; we don't have to solve it in this PR but we should leave TODOs. Actually TBH I'm not sure why 'cumulative' is ever useful but someone wanted it in the past and added it.

And we should document exactly what happens if the user specifies a not-yet-supported option like prometheus/detailed or prometheus/cumulative.

WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prometheus exposition format defines exactly how histograms are expressed. The only configurable/changeable behavior is the published buckets. No other options should be supported on prometheus histograms. (With the exception of switching to summaries instead of histograms; that is also well-defined in the exposition format)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the prometheus format lines up pretty well with cumulative. I believe the reason is so that arbitrary buckets can be dropped to down-sample and save space, without having to change any of the non-dropped bucket values (in the TSDB)

Comment thread docs/root/operations/admin.rst
Comment thread source/server/admin/prometheus_stats.cc Outdated
Comment thread source/server/admin/stats_params.cc Outdated
Comment thread test/server/admin/prometheus_stats_test.cc Outdated
@jmarantz
Copy link
Copy Markdown
Contributor

needs main merge also

/wait

Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Comment thread source/server/admin/prometheus_stats.cc Outdated
for (size_t i = 0; i < supported_quantiles.size(); ++i) {
double quantile = supported_quantiles[i];
double value = computed_quantiles[i];
output.append(fmt::format("{0}{{{1}quantile=\"{2}\"}} {3}\n", prefixed_tag_extracted_name,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the default formatting for a double is for {3} here. Do you think some specific precision should be specified?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably a good idea, although I'm a little confused by this comment about fixed-point not being supported, since it seems pretty well documented in the fmt library? Not sure what a sane default precision should be... first thought was to use .32g like in the bucket output above.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, not sure about the fmtlib f format vs that comment. I'm pretty sure I wrote the envoy comment. But I don't recall the situation. Probably either that I completely missed f in the docs, or it didn't work in the way we needed (maybe not enough significant digits or something?).

I'd say for now use the existing .32g format, and someone should circle back in a separate change and see if we can switch them all to something better or not.

Comment thread source/server/admin/prometheus_stats.cc Outdated
const std::string tags = PrometheusStatsFormatter::formattedTags(histogram.tags());
const std::string hist_tags = histogram.tags().empty() ? EMPTY_STRING : (tags + ",");

const Stats::HistogramStatistics& stats = histogram.cumulativeStatistics();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that a summary can't be aggregated in the same way a histogram can, would it make more sense to use intervalStatistics() here? https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#summary suggests that the last 5-10 minutes of data are preferable; we don't have any way to get that time window though with the collected data, at least not as a rolling window.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a bit unfortunate, as I would imagine getting a rolling window would be quite a bit more work...

Another approach could be to use tags to emit both the interval and cumulative summaries as the quantileSummary function does?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if prometheus (the server/scraper) would ingest everything properly with two sets of quantiles, only different in tags. I'm not sure what the right thing to do here is. Agreed that getting the rolling window would be a lot more work.

Comment thread source/server/admin/prometheus_stats.cc Outdated
double quantile = supported_quantiles[i];
double value = computed_quantiles[i];
output.append(fmt::format("{0}{{{1}quantile=\"{2}\"}} {3}\n", prefixed_tag_extracted_name,
hist_tags, quantile, std::isnan(value) ? 0 : value));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says that NaN is an allowed value (https://prometheus.io/docs/instrumenting/exposition_formats/). Should we leave it as-is here instead of translating it to 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to emit nan.

Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
@ravenblackx ravenblackx requested a review from ggreenway February 9, 2024 15:58
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is looking pretty good. One additional doc improvement.

@jmarantz do you want to take another look also?

Comment thread docs/root/operations/admin.rst Outdated

Optional ``histogram_buckets`` query parameter is used to control how histogram metrics get reported.
If unset, histograms get reported as the "histogram" prometheus metric type, but can also be used to
emit prometheus "summary" metrics if set to ``summary``.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth documenting that each emitted summary is over the interval of the last stats flush interval (and link to stats_flush_interval docs).

@jmarantz
Copy link
Copy Markdown
Contributor

lgtm still; and still needs main merge.

/wait

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 22, 2024
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
@github-actions github-actions Bot removed the stale stalebot believes this issue/PR has not been touched recently label Mar 25, 2024
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
@jmarantz jmarantz enabled auto-merge (squash) March 28, 2024 21:24
@jmarantz jmarantz dismissed ggreenway’s stale review April 1, 2024 13:31

request was addressed

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Apr 1, 2024

/retest

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Apr 1, 2024

@andybradshaw CI is being a little flaky I think. It says 'presubmit' failed but I couldn't find an actual error.

Can you merge main to restart it?

Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
@jmarantz jmarantz merged commit 014a68f into envoyproxy:main Apr 1, 2024
alyssawilk pushed a commit to alyssawilk/envoy that referenced this pull request Apr 29, 2024
Commit Message: Add support for prometheus summary metrics on the admin endpoint
Additional Description: Adds support emitting prometheus "summary" metrics for the internal histogram quantiles by supplying a query parameter. Multiple modes are supported, as in envoyproxy#25812, and can be either histogram, summary, or histogram,summary.
Risk Level: Low, no changes to existing default behavior
Testing: Added unit tests for histogram, summary, and summary+histogram emission
Docs Changes: Added documentation to the admin home page, and to the published admin docs around an optional query parameter.
Release Notes: Added a note in the small_feature section.

Fixes envoyproxy#30471

Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support prometheus summary metric types on the admin endpoint

4 participants