doc: Add 'Secure Alertmanager cluster traffic' design document by mxinden · Pull Request #1763 · prometheus/alertmanager

mxinden · 2019-02-21T12:29:34Z

As of today the communication between Alertmanager instances in a cluster is send in clear-text.

Instances in a cluster should communicate among each other in a secure fashion. Alertmanager should guarantee confidentiality, integrity and authenticity for each message touching the wire. While this would improve the security of single datacenter deployments, one could see this as a necessity for wide-area-network deployments.

This patch adds a design document to plan the goal above. A prove of concept implementation can be found here.

Ideas, comments, suggestions, ..., are very much appreciated!

brian-brazil

Sounds sane. How would this tie into the TLS stuff ongoing in the node exporter currently, and would/could it use the existing http client config libraries we have rather than having to reinvent all those settings?

brian-brazil · 2019-02-21T12:39:52Z

doc/design/secure-cluster-traffic.md

+- Replicate notification log
+
+As of today the communication between Alertmanager instances in a cluster is
+send in clear-text.


brian-brazil · 2019-02-21T12:41:07Z

doc/design/secure-cluster-traffic.md

+TCP connections could be kept alive beyond a single message to reduce latency as
+well as handshake overhead costs. While this is feasible in a 3-instance
+Alertmanager cluster, the discussed custom implementation would need to limit
+the amount of open connections for clusters with many instances (#connections =


Is one per other AM really a problem?

Memberlist wants to have one connection as a reliable connection all to itself. Thereby we need at least two, one reliable TCP and one pseudo-best-effort connection unless we want to go down the road of multiplexing a single TCP connection.

@brian-brazil what maximum cluster size would you expect in the future?

In principle someone might run two per datacenter, and tens of datacenters isn't that unusual. Say 100?

Alright. I will make sure to include that in the performance testing (in case we decide for this route).

The "full sync" tcp request happens relatively infrequently, and send reliable is only used for especially large gossip messages (which is probably also relatively infrequent. it happens <<1% of the time at SC). Practically speaking, each instance would only maintain one connection to the other instances.

brian-brazil · 2019-02-21T12:41:42Z

doc/design/secure-cluster-traffic.md

+instead of the best-effort UDP connection to gossip large notification logs and
+silences between instances. The reason is, that those packages would otherwise
+exceed the [MTU](https://de.wikipedia.org/wiki/Maximum_Transmission_Unit) of
+most UDP setups. Splitting packages is not supported by _Memberlist_ and was not


simonpasquier · 2019-02-21T16:11:59Z

doc/design/secure-cluster-traffic.md

+
+Instead of redirecting all best-effort traffic via the reliable channel as
+proposed above, one could also secure the best-effort channel itself using UDP
+and [DTLS](https://de.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in


https://en.wikipedia.org/wiki/Datagram_Transport_Layer_Security ;-)

Oh, I thought you might want to practice your German a bit while reading the design document. What better to read than a network protocol specification document in German :P

Thanks @simonpasquier

stuartnelson3 · 2019-02-22T10:00:31Z

haven't had a chance to look yet, but will very soon. sorry for the delay!

stuartnelson3 · 2019-02-24T13:43:20Z

doc/design/secure-cluster-traffic.md

+ideally done in an eventual consistent gossip fashion, given that Alertmanager
+is supposed to scale beyond a 3-instance cluster and beyond local-area-network
+deployments. With these requirements in mind, replacing _Memberlist_ with an
+entirely self-build communication layer is a great undertaking.


stuartnelson3 · 2019-02-24T13:52:55Z

doc/design/secure-cluster-traffic.md

+encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via
+AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling
+updates. Securing the cluster traffic via symmetric encryption would just
+involve small configuration changes in the Alertmanager code base.


If both methods require generating a key, what is the downside of this method vs. the proposed method?

I think that this would be a valid approach -- but we would need to

add amtool genkey command

specify in the doc that we use that library and that this form of encryption could change in the future

And it'd be a different way of doing auth than we're going to use elsewhere.

Could we contribute our approach upstream?

If both methods require generating a key, what is the downside of this method vs. the proposed method?

@stuartnelson3 sorry for not covering that properly in the document:

asymmetric vs symmetric: TLS gives users more possible trust structures e.g. different certificate hierarchies, enabling users to exclude a specific (bad) alertmanager instance.

default vs one-off: Symmetric crypto is easier to setup in itself, but probably not the default security option for most users, hence a one-off solution. I would expect most operators to already have a public key infrastructure for tls in place (please correct me if I am wrong).

replay attacks: Given that memberlists symmetric crypto operates on unordered channel (UDP) I don't see how it can prevent replay attacks. TLS runs on top of TCP which would discard out of order messages of a replay attack.

consistency with Prometheus: As @brian-brazil said, the suggested method would keep Alertmanager consistent with the rest of the stack.

What are your thoughts @stuartnelson3?

Could we contribute our approach upstream?

Which one?

Symetric: It is already part of memberlists core.

TLSTransport: The TLSTransport logic implementing the Transport interface could be suggested to be added as an alternative to Memberlist's NetTransport. As TLSTransport does not alter any Memberlist code, I would say this is not critical.

Does that answer the question @roidelapluie?

My initial thought was wondering about the cost of developing and maintaining our own transport (and being consistent within the prometheus org) vs. using the keyring (and being inconsistent).

The points you list here seem like enough to warrant creating our own Transport.

replay attacks: Given that memberlists symmetric crypto operates on unordered channel (UDP) I don't see how it can prevent replay attacks. TLS runs on top of TCP which would discard out of order messages of a replay attack.

Re-reading the DTLS RFC, it does prevent replay attacks via an epoch and sequence number. I am sorry for the confusion.

stuartnelson3 · 2019-02-25T10:46:58Z

I would expect most operators to already have a public key infrastructure for tls in place (please correct me if I am wrong)

add amtool genkey command

Is this something we want to add for users? How is prometheus handling this? I'm out of the loop on these efforts.

brian-brazil · 2019-02-25T10:50:27Z

Is this something we want to add for users? How is prometheus handling this? I'm out of the loop on these efforts.

It'll be up to the user to deal with cert stuff.

simonpasquier

The proposal looks ok to me.

roidelapluie · 2019-02-25T11:46:40Z

the genkey is for the"memberlist" approcach. Then we have no TLS, but a key.

mxinden · 2019-02-28T09:55:54Z

Thanks everyone for the input. Any further comments? Otherwise I will merge tonight (CET).

brian-brazil

👍

Might also be worth talking about how this will work timeline-wise with the other TLS stuff in progress in the node exporter, which seems to be actively worked on.

doc/design/secure-cluster-traffic.md

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>

mxinden · 2019-03-01T15:17:29Z

Might also be worth talking about how this will work timeline-wise with the other TLS stuff in progress in the node exporter, which seems to be actively worked on.

I have looked into prometheus/node_exporter#1198. Thanks for the hint. Given that the effort in the node_exporter still seems to be in an early phase (correct me if I am wrong), I will try to follow along and give input in regards to compatibility with this proposal. Overall reusing the logic here sounds great to me. Having one consistent way of doing TLS across the project sounds great.

brian-brazil · 2019-03-01T16:32:14Z

Having one consistent way of doing TLS across the project sounds great.

It's essential in my mind, security-related code is not something you want to be copy&pasting around.

mxinden added kind/design component/high availability labels Feb 21, 2019

brian-brazil reviewed Feb 21, 2019

View reviewed changes

mxinden force-pushed the design-secure-memberlist branch from 22c7569 to 174af3a Compare February 21, 2019 12:53

simonpasquier reviewed Feb 21, 2019

View reviewed changes

mxinden mentioned this pull request Feb 22, 2019

Securing the gossip protocol? #1322

Closed

mxinden force-pushed the design-secure-memberlist branch 2 times, most recently from f78d1a8 to 2a661b6 Compare February 22, 2019 09:55

stuartnelson3 reviewed Feb 24, 2019

View reviewed changes

mxinden force-pushed the design-secure-memberlist branch from 2a661b6 to f1f9a91 Compare February 25, 2019 10:18

simonpasquier approved these changes Feb 25, 2019

View reviewed changes

stuartnelson3 approved these changes Feb 25, 2019

View reviewed changes

brian-brazil reviewed Feb 28, 2019

View reviewed changes

doc/design/secure-cluster-traffic.md Show resolved Hide resolved

mxinden mentioned this pull request Mar 1, 2019

Adding TLS to node exporter. Server Auth only. prometheus/node_exporter#1198

Closed

doc: Add 'Secure Alertmanager cluster traffic' design document

d81b9a5

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>

mxinden force-pushed the design-secure-memberlist branch from f1f9a91 to d81b9a5 Compare March 1, 2019 15:13

mxinden merged commit 0a2d108 into prometheus:master Mar 5, 2019

mxinden mentioned this pull request Mar 11, 2019

Adding TLS to node exporter - cleaner version prometheus/node_exporter#1277

Merged

Comments

Conversation

mxinden commented Feb 21, 2019

Uh oh!

brian-brazil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartnelson3 commented Feb 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartnelson3 commented Feb 25, 2019

Uh oh!

brian-brazil commented Feb 25, 2019

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

roidelapluie commented Feb 25, 2019

Uh oh!

mxinden commented Feb 28, 2019

Uh oh!

brian-brazil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mxinden commented Mar 1, 2019

Uh oh!

brian-brazil commented Mar 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants