Prometheus stats perf improvements take 2 by rulex123 · Pull Request #28169 · envoyproxy/envoy

rulex123 · 2023-06-28T09:20:24Z

After merging https://github.com/envoyproxy/envoy/pull/24998/files, we ran into
issues caused by the fact that there are stats across different scopes that have
the same tag extracted name. This results in the Prometheus rendering being incorrect.

This patch tries to solve that issue by capturing all stats from all scopes at the beginning
of the corresponding phase in the chinking algorithm: so effectively, the only
change in the original algorithm implemented in https://github.com/envoyproxy/envoy/pull/24998/files
is found in the startPhase function of StatsRequest, which now has a different implementation
in grouped vs. ungrouped stats request class.

There still is a perf improvement (see attached files) for Prometheus rendering, though
less marked than with the original patch. With this patch the memory footprint for grouped
stats is also higher (though I haven't done any profiling at this point).

I am leaving as a draft for now because I still haven't had time to test out on one of our Envoy
production instances to verify the stats rendering works as expected.

(i.e. doesn't cater for stats with identical tag extracted name but from different scopes). Signed-off-by: rulex123 <erica.manno@gmail.com>

but are from different scopes: in order to guarantee that they are rendered correctly, we extract all counters (or gauges etc.) from all scopes at the beginning of the corresponding phase. Signed-off-by: rulex123 <erica.manno@gmail.com>

Signed-off-by: rulex123 <erica.manno@gmail.com>

repokitteh-read-only · 2023-06-28T09:20:29Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #28169 was opened by rulex123.

see: more, trace.

rulex123 · 2023-06-28T09:25:10Z

From running stats_handler_speed_test on my dev machine
perf_after.txt
perf_before.txt

rulex123 · 2023-06-28T09:25:41Z

cc @jmarantz WDYT about this idea?

jmarantz · 2023-06-28T13:08:43Z

This is a big improvement over the current state. Just want to double-check the benchmark binary was compiled with -c opt but I'm guessing it was, and in any case the improvement is s=

It's still a lot slower than text or json, but taking 5s of CPU time on the main thread down to 2 for 1M stats (IIRC) is probably a meaningful improvement.

What did you think of my notes in #16139 (comment) -- if we really want to make Prom take <1s we can maintain TagExtractedName -> Set<Stat> maps, using hooks we'd need to add to allocator_impl.cc? It might be worth exploring this option, in case it takes us down a different path than this PR. I haven't looked at the code in this PR yet.

@ggreenway should also be part of this dialog.

ggreenway · 2023-06-29T16:50:27Z

As suggested by @jmarantz I think pre-computing the tag-extracted-name relationships would be substantially faster; recomputing this everytime is wasteful. But it would be more complicated, and we'd need to figure out if we want to do this in all cases (which would use additional memory) or not. And if we didn't have that enabled, would prometheus stats still work using a slower codepath? If so we'd need to maintain two separate codepaths which isn't great.

ggreenway · 2023-06-29T16:57:16Z

Also, before introducing a big change to prometheus stats again, I'd like to add an integration test that validates some of the invariants of the format (allowed character sets, sorting, no duplicate TYPE messages, etc).

jmarantz · 2023-06-29T20:48:19Z

@ggreenway in #27239 I added a new test for duplicate types in hopes to avoid a regression on this again. Is that what you were looking for? That addresses duplicate TYPE tags though not character-set issues; aren't there tests for those already?

@ggreenway when you say "do this in all cases" I'm wondering if we could structure this so that (a) we start maintaining extracted-name maps only if /stats/prometheus is called, and then (b) optionally if some amount of time has expired and /stats/prometheus has not been called, stop maintaining the set and reclaim the memory.

ggreenway · 2023-06-29T22:06:43Z

@ggreenway when you say "do this in all cases" I'm wondering if we could structure this so that (a) we start maintaining extracted-name maps only if /stats/prometheus is called, and then (b) optionally if some amount of time has expired and /stats/prometheus has not been called, stop maintaining the set and reclaim the memory.

SGTM. How hard will it be to set that up after envoy is already configured and running?

jmarantz · 2023-06-29T23:06:44Z

I think the lazy-initiation of prom map collection and expiring that after a period of no prom scrapes shouldn't be too hard. I think this points toward a design having optional hooks in the Allocator that can be set from the prom handler, and cleared from a timer established by the prom handler.

I think adding the hooks to the allocator won't be that hard either. allocator_impl.cc has good points where these could be called. We'd need to be careful with the threading model though. In theory a stat can be removed from the allocator from any thread, but maybe we only want to only update the prom maps from the main thread, and we might not want to have a delay for post(). That requires some thought. It might be the case that in practice ThreadLocalStore will only clear the central cache on the main thread so maybe this mostly would not be a problem.

I think the hard (and fun) part will be the algorithmic aspect of updating the sets in response to a stat creation or removal!

github-actions · 2023-07-30T00:02:54Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2023-08-06T04:01:28Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

rulex123 added 3 commits June 28, 2023 09:49

Code from envoyproxy#24998 unchanged

44d0843

(i.e. doesn't cater for stats with identical tag extracted name but from different scopes). Signed-off-by: rulex123 <erica.manno@gmail.com>

Port logic for hidden stat.

87e7012

Signed-off-by: rulex123 <erica.manno@gmail.com>

github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 30, 2023

github-actions Bot closed this Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus stats perf improvements take 2#28169

Prometheus stats perf improvements take 2#28169
rulex123 wants to merge 3 commits intoenvoyproxy:mainfrom
rulex123:try-out

rulex123 commented Jun 28, 2023

Uh oh!

repokitteh-read-only Bot commented Jun 28, 2023

Uh oh!

rulex123 commented Jun 28, 2023

Uh oh!

rulex123 commented Jun 28, 2023

Uh oh!

jmarantz commented Jun 28, 2023

Uh oh!

ggreenway commented Jun 29, 2023

Uh oh!

ggreenway commented Jun 29, 2023

Uh oh!

jmarantz commented Jun 29, 2023 •

edited

Loading

Uh oh!

ggreenway commented Jun 29, 2023

Uh oh!

jmarantz commented Jun 29, 2023

Uh oh!

github-actions Bot commented Jul 30, 2023

Uh oh!

github-actions Bot commented Aug 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rulex123 commented Jun 28, 2023

Uh oh!

repokitteh-read-only Bot commented Jun 28, 2023

Uh oh!

rulex123 commented Jun 28, 2023

Uh oh!

rulex123 commented Jun 28, 2023

Uh oh!

jmarantz commented Jun 28, 2023

Uh oh!

ggreenway commented Jun 29, 2023

Uh oh!

ggreenway commented Jun 29, 2023

Uh oh!

jmarantz commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggreenway commented Jun 29, 2023

Uh oh!

jmarantz commented Jun 29, 2023

Uh oh!

github-actions Bot commented Jul 30, 2023

Uh oh!

github-actions Bot commented Aug 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jmarantz commented Jun 29, 2023 •

edited

Loading