Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s#13478
Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s#13478MadLittleMods wants to merge 6 commits into
synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s#13478Conversation
| 0.005, | ||
| 0.01, | ||
| 0.025, | ||
| 0.05, | ||
| 0.075, | ||
| 0.1, | ||
| 0.25, | ||
| 0.5, | ||
| 0.75, | ||
| 1.0, | ||
| 2.5, | ||
| 5.0, | ||
| 7.5, | ||
| 10.0, |
There was a problem hiding this comment.
This section matches the default buckets: https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544 (0.005 - 10.0)
I chose the default as a base because that is what it was using before. Do we want to tune these or eliminate any to reduce cardinality?
There was a problem hiding this comment.
Adding ~30% more buckets seems like a step in the wrong direction for #11082.
I've just noticed this comment. Perhaps we could drop the 0.075, 0.75 and 7.5 metrics? Then the remaining ones would be separated by roughly factors of two.
We'd still be growing the number of buckets by 2 in that case though. If we wanted to avoid growing the cardinality we'd have to pick 2 more to drop.
There was a problem hiding this comment.
We could drop 200.0 as well since above 180 probably hit the timeout
synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s
|
I'm not sure if we really want this. There have been complaints in the past that the Adding ~30% more buckets seems like a step in the wrong direction for #11082. Is there a particular insight we're hoping to gain by raising the 10 second cap? |
| @@ -43,6 +43,28 @@ | |||
| "synapse_http_server_response_time_seconds", | |||
There was a problem hiding this comment.
Is there a particular insight we're hoping to gain by raising the 10 second cap?
I'm trying to optimize the slow /messages requests, #13356, specifically those that take more than 10s.
In order to track progress there, I'd like the metrics to capture them.
|
Hang on two secs, I'm a bit concerned by removing the 75 buckets in terms of losing the definition in the common cases |
|
Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate. Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the |
erikjohnston
left a comment
There was a problem hiding this comment.
c.f. comment about losing fidelity
| 120.0, | ||
| 180.0, | ||
| "+Inf", | ||
| ), |
There was a problem hiding this comment.
Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.
Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the
/messagesAPI. This would also allow us to differentiate between local and remote/messagesrequests for example.
I can create a separate PR to add a specific metric for /messages -> #13533
But do we have any interest in adjusting the buckets for the general case? @erikjohnston mentioned if anything maybe wanting even more fidelity in the lower ranges. @richvdh do you have any interest in increasing for another endpoint? Our limiting factor is cardinality since this multiplies out to all of our servlets.
I think @MadLittleMods is right in that the top bucket should be more than 10s given how often some of our endpoints take longer than that
-- @richvdh, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$CLJ5oioD_DO1A_zSGmYtCd-yToSyA6EiOwOsClvfdcs?via=matrix.org&via=element.io&via=beeper.com
In terms of reducing cardinality, we could remove code. I think for timing, we really just need the method and servlet name. Response code can be useful but maybe we just need to change it to a successful_response boolean (with a cardinality of 2, [true|false]) since we only ever use it as code=~"2..". Or maybe more useful as error_response: true/false so that success or timeout can still be false while an actual error would be true.
There was a problem hiding this comment.



Allow
synapse_http_server_response_time_secondsGrafana histogram quantiles to show values bigger than 10sPart of #13356
Before
Purple line
>99%percentile has false max ceiling of10sbecause the values don't go above10.https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=%24__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All&from=1660039325520&to=1660060925520&viewPanel=152

After
I don't know if this actually fixes it (haven't tested).
Dev notes
Docs:
https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544
synapse_http_server_response_time_seconds_bucketsynapse_http_server_response_time_seconds_sumsynapse_http_server_response_time_seconds_countPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.Pull request includes a sign off(run the linters)