Allow `synapse_http_server_response_time_seconds` Grafana histogram quantiles to show values bigger than 10s by MadLittleMods · Pull Request #13478 · matrix-org/synapse

MadLittleMods · 2022-08-09T03:08:46Z

Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s

If you use the default histogram buckets, or guess poorly (likely) the first time around, you will probably see a straight line at 10 (or your highest bucket boundary) when you calculate quantiles on this histogram. This is due to the fact that your quantiles or most of your observations are found in the +Inf bucket. No linear approximation saves you here, the only “sane” thing to do is return the lower known bound.

-- https://linuxczar.net/blog/2017/06/15/prometheus-histogram-2/

By default, these buckets are: .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10. This is very much tuned to measuring request durations below 10 seconds, so if you’re measuring something else you may need to configure custom buckets.

-- https://tomgregory.com/the-four-types-of-prometheus-metrics/

you have to pre-choose your buckets, and the costs moves from the client to Prometheus itself due to bucket cardinality. The default ten buckets cover a typical web service with latency in the millisecond to second range, and on occasion you will want to adjust them.

-- https://www.robustperception.io/how-does-a-prometheus-histogram-work/

Part of #13356

Before

Purple line >99% percentile has false max ceiling of 10s because the values don't go above 10.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=%24__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All&from=1660039325520&to=1660060925520&viewPanel=152

After

I don't know if this actually fixes it (haven't tested).

Dev notes

Docs:

https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544

DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)

synapse_http_server_response_time_seconds_bucket
synapse_http_server_response_time_seconds_sum
synapse_http_server_response_time_seconds_count

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
~~Pull request includes a sign off~~
Code style is correct
(run the linters)

MadLittleMods · 2022-08-09T16:12:14Z

+        0.005,
+        0.01,
+        0.025,
+        0.05,
+        0.075,
+        0.1,
+        0.25,
+        0.5,
+        0.75,
+        1.0,
+        2.5,
+        5.0,
+        7.5,
+        10.0,


This section matches the default buckets: https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544 (0.005 - 10.0)

I chose the default as a base because that is what it was using before. Do we want to tune these or eliminate any to reduce cardinality?

Adding ~30% more buckets seems like a step in the wrong direction for #11082.

I've just noticed this comment. Perhaps we could drop the 0.075, 0.75 and 7.5 metrics? Then the remaining ones would be separated by roughly factors of two.

We'd still be growing the number of buckets by 2 in that case though. If we wanted to avoid growing the cardinality we'd have to pick 2 more to drop.

We could drop 200.0 as well since above 180 probably hit the timeout

squahtx · 2022-08-11T10:43:45Z

I'm not sure if we really want this. There have been complaints in the past that the synapse_http_server_response_time collection of metrics is too big and causes performance issues with Prometheus:

synapse_http_server_response_time_seconds_bucket has too high cardinality #11082

Adding ~30% more buckets seems like a step in the wrong direction for #11082. Is there a particular insight we're hoping to gain by raising the 10 second cap?

MadLittleMods · 2022-08-11T15:25:08Z

@@ -43,6 +43,28 @@
    "synapse_http_server_response_time_seconds",


Is there a particular insight we're hoping to gain by raising the 10 second cap?

I'm trying to optimize the slow /messages requests, #13356, specifically those that take more than 10s.

In order to track progress there, I'd like the metrics to capture them.

See #13478 (comment)

See #13489 (comment)

This reverts commit 477fad6.

See #13478 (comment)

squahtx

LGTM!

We only add one extra bucket overall, which isn't too bad.

erikjohnston · 2022-08-12T10:31:42Z

Hang on two secs, I'm a bit concerned by removing the 75 buckets in terms of losing the definition in the common cases

erikjohnston · 2022-08-12T10:45:06Z

Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.

Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.

erikjohnston

c.f. comment about losing fidelity

MadLittleMods · 2022-08-15T20:44:31Z

+        120.0,
+        180.0,
+        "+Inf",
+    ),


Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.

Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.

-- @erikjohnston, #13478 (comment)

I can create a separate PR to add a specific metric for /messages -> #13533

But do we have any interest in adjusting the buckets for the general case? @erikjohnston mentioned if anything maybe wanting even more fidelity in the lower ranges. @richvdh do you have any interest in increasing for another endpoint? Our limiting factor is cardinality since this multiplies out to all of our servlets.

I think @MadLittleMods is right in that the top bucket should be more than 10s given how often some of our endpoints take longer than that

-- @richvdh, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$CLJ5oioD_DO1A_zSGmYtCd-yToSyA6EiOwOsClvfdcs?via=matrix.org&via=element.io&via=beeper.com

https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1660574475300&to=1660596075300&viewPanel=47

In terms of reducing cardinality, we could remove code. I think for timing, we really just need the method and servlet name. Response code can be useful but maybe we just need to change it to a successful_response boolean (with a cardinality of 2, [true|false]) since we only ever use it as code=~"2..". Or maybe more useful as error_response: true/false so that success or timeout can still be false while an actual error would be true.

@richvdh do you have any interest in increasing for another endpoint?

I'm not sure really. I guess? I don't feel like it's a huge priority to me right now.

In terms of reducing cardinality, we could remove code.

Probably, yes. but that's a conversation we should have over at #11082.

Split out from #13478 (comment)

Allow grafana histogram quantiles to show values bigger than 10s

ecd5a0d

MadLittleMods commented Aug 9, 2022

View reviewed changes

Comment thread synapse/http/request_metrics.py Outdated

MadLittleMods commented Aug 9, 2022

View reviewed changes

MadLittleMods added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. A-Metrics metrics, measures, stuff we put in Prometheus labels Aug 9, 2022

MadLittleMods changed the title ~~Allow Grafana histogram quantiles to show values bigger than 10s~~ Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s Aug 9, 2022

Add changelog

71a8c55

MadLittleMods marked this pull request as ready for review August 9, 2022 16:40

MadLittleMods requested a review from a team as a code owner August 9, 2022 16:40

squahtx reviewed Aug 11, 2022

View reviewed changes

Comment thread synapse/http/request_metrics.py

MadLittleMods commented Aug 11, 2022

View reviewed changes

Tune buckets

54b3676

See #13478 (comment)

MadLittleMods requested a review from squahtx August 11, 2022 15:31

squahtx reviewed Aug 11, 2022

View reviewed changes

Comment thread synapse/http/request_metrics.py Outdated

MadLittleMods added 3 commits August 11, 2022 14:58

Refactor recursive code so we can wrap just the redaction part

477fad6

See #13489 (comment)

Revert "Refactor recursive code so we can wrap just the redaction part"

62f9d35

This reverts commit 477fad6.

Drop 7.5s

400688b

See #13478 (comment)

MadLittleMods requested a review from squahtx August 11, 2022 21:08

squahtx approved these changes Aug 12, 2022

View reviewed changes

erikjohnston suggested changes Aug 12, 2022

View reviewed changes

MadLittleMods commented Aug 15, 2022

View reviewed changes

MadLittleMods added a commit that referenced this pull request Aug 16, 2022

Add specific metric to time long-running /messages requests

64d9900

Split out from #13478 (comment)

This was referenced Aug 16, 2022

Add specific metric to time long-running /messages requests #13533

Merged

synapse_http_server_response_time_seconds_bucket has too high cardinality #11082

Open

MadLittleMods closed this Aug 16, 2022

+.005,
+.01,
+.025,
+.05,
+.075,
+.1,
+.25,
+.5,
+.75,
+.0,
+.5,
+.0,
+.5,
+.0,

		@@ -43,6 +43,28 @@
		"synapse_http_server_response_time_seconds",

+.0,
+.0,
+                      "+Inf",
+                  ),

Uh oh!

Conversation

MadLittleMods commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Dev notes

Pull Request Checklist

Uh oh!

Uh oh!

MadLittleMods Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squahtx Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

squahtx commented Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MadLittleMods Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

squahtx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erikjohnston commented Aug 12, 2022

Uh oh!

erikjohnston commented Aug 12, 2022

Uh oh!

erikjohnston left a comment

Choose a reason for hiding this comment

Uh oh!

MadLittleMods Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richvdh Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MadLittleMods commented Aug 9, 2022 •

edited

Loading

MadLittleMods Aug 9, 2022 •

edited

Loading

squahtx commented Aug 11, 2022 •

edited

Loading

squahtx left a comment •

edited

Loading

MadLittleMods Aug 15, 2022 •

edited

Loading