Skip to content

Prometheus stats perf improvements take 2#28169

Closed
rulex123 wants to merge 3 commits intoenvoyproxy:mainfrom
rulex123:try-out
Closed

Prometheus stats perf improvements take 2#28169
rulex123 wants to merge 3 commits intoenvoyproxy:mainfrom
rulex123:try-out

Conversation

@rulex123
Copy link
Copy Markdown
Contributor

After merging https://github.com/envoyproxy/envoy/pull/24998/files, we ran into
issues caused by the fact that there are stats across different scopes that have
the same tag extracted name. This results in the Prometheus rendering being incorrect.

This patch tries to solve that issue by capturing all stats from all scopes at the beginning
of the corresponding phase in the chinking algorithm: so effectively, the only
change in the original algorithm implemented in https://github.com/envoyproxy/envoy/pull/24998/files
is found in the startPhase function of StatsRequest, which now has a different implementation
in grouped vs. ungrouped stats request class.

There still is a perf improvement (see attached files) for Prometheus rendering, though
less marked than with the original patch. With this patch the memory footprint for grouped
stats is also higher (though I haven't done any profiling at this point).

I am leaving as a draft for now because I still haven't had time to test out on one of our Envoy
production instances to verify the stats rendering works as expected.

rulex123 added 3 commits June 28, 2023 09:49
(i.e. doesn't cater for stats with identical tag extracted name but from
different scopes).

Signed-off-by: rulex123 <erica.manno@gmail.com>
but are from different scopes: in order to guarantee that they are
rendered correctly, we extract all counters (or gauges etc.) from
all scopes at the beginning of the corresponding phase.

Signed-off-by: rulex123 <erica.manno@gmail.com>
Signed-off-by: rulex123 <erica.manno@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #28169 was opened by rulex123.

see: more, trace.

@rulex123
Copy link
Copy Markdown
Contributor Author

From running stats_handler_speed_test on my dev machine
perf_after.txt
perf_before.txt

@rulex123
Copy link
Copy Markdown
Contributor Author

cc @jmarantz WDYT about this idea?

@jmarantz
Copy link
Copy Markdown
Contributor

This is a big improvement over the current state. Just want to double-check the benchmark binary was compiled with -c opt but I'm guessing it was, and in any case the improvement is s=

It's still a lot slower than text or json, but taking 5s of CPU time on the main thread down to 2 for 1M stats (IIRC) is probably a meaningful improvement.

What did you think of my notes in #16139 (comment) -- if we really want to make Prom take <1s we can maintain TagExtractedName -> Set<Stat> maps, using hooks we'd need to add to allocator_impl.cc? It might be worth exploring this option, in case it takes us down a different path than this PR. I haven't looked at the code in this PR yet.

@ggreenway should also be part of this dialog.

@ggreenway
Copy link
Copy Markdown
Member

As suggested by @jmarantz I think pre-computing the tag-extracted-name relationships would be substantially faster; recomputing this everytime is wasteful. But it would be more complicated, and we'd need to figure out if we want to do this in all cases (which would use additional memory) or not. And if we didn't have that enabled, would prometheus stats still work using a slower codepath? If so we'd need to maintain two separate codepaths which isn't great.

@ggreenway
Copy link
Copy Markdown
Member

Also, before introducing a big change to prometheus stats again, I'd like to add an integration test that validates some of the invariants of the format (allowed character sets, sorting, no duplicate TYPE messages, etc).

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Jun 29, 2023

@ggreenway in #27239 I added a new test for duplicate types in hopes to avoid a regression on this again. Is that what you were looking for? That addresses duplicate TYPE tags though not character-set issues; aren't there tests for those already?

@ggreenway when you say "do this in all cases" I'm wondering if we could structure this so that (a) we start maintaining extracted-name maps only if /stats/prometheus is called, and then (b) optionally if some amount of time has expired and /stats/prometheus has not been called, stop maintaining the set and reclaim the memory.

@ggreenway
Copy link
Copy Markdown
Member

@ggreenway when you say "do this in all cases" I'm wondering if we could structure this so that (a) we start maintaining extracted-name maps only if /stats/prometheus is called, and then (b) optionally if some amount of time has expired and /stats/prometheus has not been called, stop maintaining the set and reclaim the memory.

SGTM. How hard will it be to set that up after envoy is already configured and running?

@jmarantz
Copy link
Copy Markdown
Contributor

I think the lazy-initiation of prom map collection and expiring that after a period of no prom scrapes shouldn't be too hard. I think this points toward a design having optional hooks in the Allocator that can be set from the prom handler, and cleared from a timer established by the prom handler.

I think adding the hooks to the allocator won't be that hard either. allocator_impl.cc has good points where these could be called. We'd need to be careful with the threading model though. In theory a stat can be removed from the allocator from any thread, but maybe we only want to only update the prom maps from the main thread, and we might not want to have a delay for post(). That requires some thought. It might be the case that in practice ThreadLocalStore will only clear the central cache on the main thread so maybe this mostly would not be a problem.

I think the hard (and fun) part will be the algorithmic aspect of updating the sets in response to a stat creation or removal!

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 30, 2023
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Aug 6, 2023

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot closed this Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants