admin: Add support for prometheus summary metrics#30479
admin: Add support for prometheus summary metrics#30479jmarantz merged 36 commits intoenvoyproxy:mainfrom
Conversation
|
Hi @andybradshaw, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
39b4ab3 to
664eb0d
Compare
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
|
/wait |
…ermine summary emission Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
| are reported. Supported modes are ``histogram`` and ``summary``, referring to the corresponding prometheus metric | ||
| types. | ||
| The ``/stats/prometheus`` endpoint can now emit prometheus ``summary`` metric types by explicitly setting the | ||
| ``histogram_buckets`` query parameter to ``none``. |
There was a problem hiding this comment.
I would vote for renaming "none" to "summary" in this PR, leaving 'none' as a synonym for "summary".
| of the text readout stat changes, which could create an unbounded number of time series. | ||
|
|
||
| .. http:get:: /stats?format=prometheus&histogram_emit_mode=histogram,summary | ||
| .. http:get:: /stats?format=prometheus&histogram_buckets=none |
There was a problem hiding this comment.
For text and json, all 4 histogram_buckets modes are supported, and currently for prom, that is ignored, and in this PR you are adding support for 'summary'.
Right now you don't have 'detailed' or 'cumulative' supported for Prom and I'd consider that tech-debt; we don't have to solve it in this PR but we should leave TODOs. Actually TBH I'm not sure why 'cumulative' is ever useful but someone wanted it in the past and added it.
And we should document exactly what happens if the user specifies a not-yet-supported option like prometheus/detailed or prometheus/cumulative.
WDYT?
There was a problem hiding this comment.
The prometheus exposition format defines exactly how histograms are expressed. The only configurable/changeable behavior is the published buckets. No other options should be supported on prometheus histograms. (With the exception of switching to summaries instead of histograms; that is also well-defined in the exposition format)
There was a problem hiding this comment.
FWIW, the prometheus format lines up pretty well with cumulative. I believe the reason is so that arbitrary buckets can be dropped to down-sample and save space, without having to change any of the non-dropped bucket values (in the TSDB)
|
needs main merge also /wait |
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
| for (size_t i = 0; i < supported_quantiles.size(); ++i) { | ||
| double quantile = supported_quantiles[i]; | ||
| double value = computed_quantiles[i]; | ||
| output.append(fmt::format("{0}{{{1}quantile=\"{2}\"}} {3}\n", prefixed_tag_extracted_name, |
There was a problem hiding this comment.
I'm not sure what the default formatting for a double is for {3} here. Do you think some specific precision should be specified?
There was a problem hiding this comment.
Yeah, that's probably a good idea, although I'm a little confused by this comment about fixed-point not being supported, since it seems pretty well documented in the fmt library? Not sure what a sane default precision should be... first thought was to use .32g like in the bucket output above.
There was a problem hiding this comment.
Honestly, not sure about the fmtlib f format vs that comment. I'm pretty sure I wrote the envoy comment. But I don't recall the situation. Probably either that I completely missed f in the docs, or it didn't work in the way we needed (maybe not enough significant digits or something?).
I'd say for now use the existing .32g format, and someone should circle back in a separate change and see if we can switch them all to something better or not.
| const std::string tags = PrometheusStatsFormatter::formattedTags(histogram.tags()); | ||
| const std::string hist_tags = histogram.tags().empty() ? EMPTY_STRING : (tags + ","); | ||
|
|
||
| const Stats::HistogramStatistics& stats = histogram.cumulativeStatistics(); |
There was a problem hiding this comment.
Given that a summary can't be aggregated in the same way a histogram can, would it make more sense to use intervalStatistics() here? https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#summary suggests that the last 5-10 minutes of data are preferable; we don't have any way to get that time window though with the collected data, at least not as a rolling window.
There was a problem hiding this comment.
Yeah, this is a bit unfortunate, as I would imagine getting a rolling window would be quite a bit more work...
Another approach could be to use tags to emit both the interval and cumulative summaries as the quantileSummary function does?
There was a problem hiding this comment.
I'm not sure if prometheus (the server/scraper) would ingest everything properly with two sets of quantiles, only different in tags. I'm not sure what the right thing to do here is. Agreed that getting the rolling window would be a lot more work.
| double quantile = supported_quantiles[i]; | ||
| double value = computed_quantiles[i]; | ||
| output.append(fmt::format("{0}{{{1}quantile=\"{2}\"}} {3}\n", prefixed_tag_extracted_name, | ||
| hist_tags, quantile, std::isnan(value) ? 0 : value)); |
There was a problem hiding this comment.
The spec says that NaN is an allowed value (https://prometheus.io/docs/instrumenting/exposition_formats/). Should we leave it as-is here instead of translating it to 0?
There was a problem hiding this comment.
Updated to emit nan.
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
|
|
||
| Optional ``histogram_buckets`` query parameter is used to control how histogram metrics get reported. | ||
| If unset, histograms get reported as the "histogram" prometheus metric type, but can also be used to | ||
| emit prometheus "summary" metrics if set to ``summary``. |
There was a problem hiding this comment.
I think it's worth documenting that each emitted summary is over the interval of the last stats flush interval (and link to stats_flush_interval docs).
|
lgtm still; and still needs main merge. /wait |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
|
/retest |
|
@andybradshaw CI is being a little flaky I think. It says 'presubmit' failed but I couldn't find an actual error. Can you merge main to restart it? |
Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Commit Message: Add support for prometheus summary metrics on the admin endpoint Additional Description: Adds support emitting prometheus "summary" metrics for the internal histogram quantiles by supplying a query parameter. Multiple modes are supported, as in envoyproxy#25812, and can be either histogram, summary, or histogram,summary. Risk Level: Low, no changes to existing default behavior Testing: Added unit tests for histogram, summary, and summary+histogram emission Docs Changes: Added documentation to the admin home page, and to the published admin docs around an optional query parameter. Release Notes: Added a note in the small_feature section. Fixes envoyproxy#30471 Signed-off-by: Andy Bradshaw <abradshaw@palantir.com>
Commit Message: Add support for prometheus summary metrics on the admin endpoint
Additional Description: Adds support emitting prometheus "summary" metrics for the internal histogram quantiles by supplying a query parameter. Multiple modes are supported, as in #25812, and can be either
histogram,summary, orhistogram,summary.Risk Level: Low, no changes to existing default behavior
Testing: Added unit tests for histogram, summary, and summary+histogram emission
Docs Changes: Added documentation to the admin home page, and to the published admin docs around an optional query parameter.
Release Notes: Added a note in the
small_featuresection.Fixes #30471