Skip to content

stats: metric/tag name fix for Prometheus#2303

Merged
mattklein123 merged 5 commits intoenvoyproxy:masterfrom
srvaroa:g
Jan 4, 2018
Merged

stats: metric/tag name fix for Prometheus#2303
mattklein123 merged 5 commits intoenvoyproxy:masterfrom
srvaroa:g

Conversation

@srvaroa
Copy link
Copy Markdown
Contributor

@srvaroa srvaroa commented Jan 3, 2018

Description
The Prometheus metrics formatter only sanitizes '.', but leaves other
non supported chars like '-'. A case where this may appear is on cipher
suites as reported in #2263.

The change adds a second replace for '-' (rather than a regex to avoid
the perf. hit, same reason for not matching on all [^a-Z0-9_] chars.)
The replacement is only applied to metric and tag names, not tag
values.

As a result of this change, the current formatting would change from:

# TYPE envoy_cluster_ssl_ciphers_ECDHE-RSA-AES128-GCM-SHA256 counter
envoy_cluster_ssl_ciphers_ECDHE-RSA-AES128-GCM-SHA256{envoy_cluster_name="local_service"} 8

to

# TYPE envoy_cluster_ssl_ciphers_ECDHE-RSA-AES128-GCM-SHA256 counter
envoy_cluster_ssl_ciphers_ECDHE_RSA_AES128_GCM_SHA256{envoy_cluster_name="local_service"} 8

Fixing the general case for any metric/tag name that includes '-'.

Following the suggestion on the original issue, the cipher suite is also moved to a label instead. The specific example above results in:

# TYPE envoy_cluster_ssl_ciphers_ECDHE-RSA-AES128-GCM-SHA256 counter
envoy_cluster_ssl_ciphers{envoy_cluster_name="local_service", cipher_suite="ECDHE-RSA-AES128-GCM-SHA256"} 8

Risk Level: Low

Testing:

Updated unit/integration rests, + new unit tests.

There is a set of unit tests that can verify the sanitization of metrics
names when exported to Prometheus. These rely on the default set of
metrics for a fresh server which, incidentally, doesn't include metrics
with some prohibited characters. #2263 reports one of these cases, '-'.
An integration test falls in a similar case.

A fix for #2263 is trivial, but adding a regression test is not. The
root reason is that the affected method is private in AdminImpl, so the
only way of hitting the right code path is to add excessive complexity
as both unit/integration versions would require configuration that is
known to generate metrics with '-', etc.

This PR makes a preliminary refactor on admin.cc, moving all code
involved in formatting metrics for Prometheus to its own class where
methods can be made public (instead of polluting AdminImpl's API.) A
later improvement could also refactor the JSON formatter to a separate
class and introduce a strategy pattern for formatters. This requires a
shuffling a bit more code so I preferred to take a single step here.

Docs Changes:
N/A

Release Notes:
N/A

Fixes issue
#2263

There is a set of unit tests that can verify the sanitization of metrics
names when exported to Prometheus.  These rely on the default set of
metrics for a fresh server which, incidentally, doesn't include metrics
with some prohibited characters. #2263 reports one of these cases, '-'.
An integration test falls in a similar case.

A fix for #2263 is trivial, but adding a regression test is not.  The
root reason is that the affected method is private in AdminImpl, so the
only way of hitting the right code path is to add excessive complexity
as both unit/integration versions would require configuration that is
known to generate metrics with '-', etc.

This commit makes a preliminary refactor on admin.cc, moving all code
involved in formatting metrics for Prometheus to its own class where
methods can be made public (instead of polluting AdminImpl's API.) A
later improvement could also refactor the JSON formatter to a separate
class and introduce a strategy pattern for formatters.  This requires a
shuffling a bit more code so I preferred to take a single step here.

Tests are supplied, matching the current behaviour, but still incorrect
as per #2263.  A later commit will fix #2263, and update the test
accordingly.

Signed-off-by: Galo Navarro <anglorvaroa@gmail.com>
The Prometheus metrics formatter only sanitizes '.', but leaves other
non supported chars like '-'.  A case where this may appear is on cipher
suites as reported in #2263.

The change simply adds a second replace for '-' (rather than a regex to
avoid the perf hit, same reason for not matching on all [^a-Z0-9_]
chars.)

This replacement is only applied to metric and tag names, not to tag
values.

Signed-off-by: Galo Navarro <anglorvaroa@gmail.com>
Following the suggestion in the bug report, the cipher suite moves to a
tag from the metric name.

Signed-off-by: Galo Navarro <anglorvaroa@gmail.com>
@ggreenway
Copy link
Copy Markdown
Member

I haven't looked at the code yet, but there's already an issue for this: #2263

As described in the issue, I think the correct fix here is for the cipher suite to be a label.

@srvaroa
Copy link
Copy Markdown
Contributor Author

srvaroa commented Jan 3, 2018

@mattklein123 thanks in advance for the review.

Three doubts:

The second commit in the PR implies that metric names with '-' (not only in the cipher suite case) will get changed when exported to Prometheus. This should not break anything (no user would've been seeing those metrics as the name was illegal). Reviewing the contribution guidelines metric names don't seem to require updates on release notes / documentation, but I wanted to double check with you, since metric names are still user-facing. Let me know if you'd like to see any updates on docs.

On the third commit in the PR:

  1. @pschulz suggested moving the metric to "..._used_total", but I tried to minimize changes to the existing names and check with you beforehand. Happy to change to that or other names.
  2. The fix in the 2 first commits seems orthogonal to the third, it might make sense to split in 2 PRs but given that this implied two changes in metric names I preferred to do them at once. Let me know if you prefer to split.
envoy_cluster_ssl_ciphers_used_total{envoy_cluster_name="local_service", cipher_suite="ECDHE-RSA-AES128-GCM-SHA256"} 8

@ggreenway
Copy link
Copy Markdown
Member

Given that the previous names were illegal, I don't think there's any issue with changing the names.

@srvaroa
Copy link
Copy Markdown
Contributor Author

srvaroa commented Jan 3, 2018

@ggreenway thanks for the pointer, it's indeed linked to that issue (should I add the # to the PR title too?)

The 3rd commit moves the cipher suite to a label as suggested in the issue. However, sanitization of '-' seems to be also relevant, so I did this too in the first 2 commits (I noted in my previous comment the rationale for not splitting in 2 PRs, but happy to do that if preferred.)

@ggreenway
Copy link
Copy Markdown
Member

I think the proper link is to add "Fixes #2263" in the description. I think it's fine to leave this as 1 PR, but you can split it into two if you'd like.

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good. Just 1 nit about naming and comments.

return rc;
}

std::string AdminImpl::sanitizePrometheusName(const std::string& name) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more descriptive name than "strict" would be better. I think the distinction here is whether a stat name or a label name is being sanitized, right? If so, can you rename to something that implies that? Also, please comment on what the prometheus rules are, or link to the prometheus docs, so somebody reading the code can understand what's going on here.

Copy link
Copy Markdown
Contributor Author

@srvaroa srvaroa Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I forgot a relevant doubt here.

If you check the code in ::formatTagsForPrometheus, it uses the sanitize function both for the tag name and the value. This seems odd to me, as we should not need to apply the sanitization to the tag value as well. In fact, the change required in this issue needs us not to apply the replace on '-'. I added the "isStrict" to distinguish between real names or values (this is what the 3rd commit contains.) In values, I respect the current replace of '.', but not apply it to '-'.

This was to minimize changes, but as I said, sanitizing values seems odd. I was wondering if rather than adding that if case, we should just not sanitize values at all. That is:

+    buf.push_back(fmt::format("{}=\"{}\"", sanitizePrometheusName(tag.name_), tag.value_)); 

I'm going to submit a commit addressing your current comments (better naming and comments) for now, but let me know if you agree on my suggestion above and I'll add it later.

cc @mattklein123

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems like we should only sanitize the value, and not the name. Typically, the value will come from configuration (although in the case of cipher suites that isn't true; they come from the ssl library). Whereas the tag name should be a compile-time constant in envoy somewhere.

Or am I misunderstanding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed @ ff21df3

I've added comments to the header file, cleaned up var / method names, and removed the strict hack as looking at Prometheus docs, it seems pointless. According to the data model, label values may take any unicode character, so there should be no need for sanitization.

You're right that the name may not need sanitization as it comes from configuration (in our case, the label name comes from well_known_names.h). I kept it because it's very easy that someone adds tag name unaware that some chars are not supported by Prometheus, and detecting this will be hard. Keeping the label name sanitization would solve that transparently. But let me know if you prefer to remove it.

Thanks again for the quick reviews & comments.

@ggreenway
Copy link
Copy Markdown
Member

Also, you need to fix_format

Signed-off-by: Galo Navarro <anglorvaroa@gmail.com>
Following review comments, this:

- Cleans up old method / variable names in the Prometheus exporter to
  make intent clearer.
- Adds docs, with references to Prometheus requirements on label and
  metric names.
- Removes sanitization from label values, as any unicode char is
  accepted (as per Prometheus docs.)

Signed-off-by: Galo Navarro <anglorvaroa@gmail.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for the fixes and cleanups!

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! Thanks.

@mattklein123 mattklein123 merged commit 1670485 into envoyproxy:master Jan 4, 2018
JonathanO added a commit to JonathanO/envoy that referenced this pull request Jul 10, 2018
… issue envoyproxy#3019 (envoyproxy#3049)"

This reverts commit c2751df.

Prometheus label values are permitted to contain any Unicode char. The
original bug (envoyproxy#3019) was due to metric names not being sanitized, which
was resolved by envoyproxy#2303. Change envoyproxy#3049 was therefore unnecessary and
leads to an unexpected change in label values.

Signed-off-by: Jonathan Oddy <jonathan.oddy@transferwise.com>
htuch pushed a commit that referenced this pull request Jul 10, 2018
… issue #3019 (#3049)" (#3830)

This reverts commit c2751df.

The originally reported bug (#3019) was due to metric names not being sanitized, which I think was resolved by #2303.
Change #3049, which was intended to fix #3019, erroneously applied sanitization to label values instead. Prometheus label values are permitted to contain any Unicode char, so this change was unnecessary and leads to an unexpected change in e.g. cluster names.

This PR reverts the change, so that - and . will be preserved in label values.

Risk Level: low

Signed-off-by: Jonathan Oddy <jonathan.oddy@transferwise.com>
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Risk Level: low
Testing: yep
Release Notes: inline
Fixes #2303

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Risk Level: low
Testing: yep
Release Notes: inline
Fixes #2303

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants