doc: Add 'Secure Alertmanager cluster traffic' design document#1763
doc: Add 'Secure Alertmanager cluster traffic' design document#1763mxinden merged 1 commit intoprometheus:masterfrom
Conversation
brian-brazil
left a comment
There was a problem hiding this comment.
Sounds sane. How would this tie into the TLS stuff ongoing in the node exporter currently, and would/could it use the existing http client config libraries we have rather than having to reinvent all those settings?
doc/design/secure-cluster-traffic.md
Outdated
| - Replicate notification log | ||
|
|
||
| As of today the communication between Alertmanager instances in a cluster is | ||
| send in clear-text. |
| TCP connections could be kept alive beyond a single message to reduce latency as | ||
| well as handshake overhead costs. While this is feasible in a 3-instance | ||
| Alertmanager cluster, the discussed custom implementation would need to limit | ||
| the amount of open connections for clusters with many instances (#connections = |
There was a problem hiding this comment.
Is one per other AM really a problem?
There was a problem hiding this comment.
Memberlist wants to have one connection as a reliable connection all to itself. Thereby we need at least two, one reliable TCP and one pseudo-best-effort connection unless we want to go down the road of multiplexing a single TCP connection.
@brian-brazil what maximum cluster size would you expect in the future?
There was a problem hiding this comment.
In principle someone might run two per datacenter, and tens of datacenters isn't that unusual. Say 100?
There was a problem hiding this comment.
Alright. I will make sure to include that in the performance testing (in case we decide for this route).
There was a problem hiding this comment.
The "full sync" tcp request happens relatively infrequently, and send reliable is only used for especially large gossip messages (which is probably also relatively infrequent. it happens <<1% of the time at SC). Practically speaking, each instance would only maintain one connection to the other instances.
doc/design/secure-cluster-traffic.md
Outdated
| instead of the best-effort UDP connection to gossip large notification logs and | ||
| silences between instances. The reason is, that those packages would otherwise | ||
| exceed the [MTU](https://de.wikipedia.org/wiki/Maximum_Transmission_Unit) of | ||
| most UDP setups. Splitting packages is not supported by _Memberlist_ and was not |
22c7569 to
174af3a
Compare
doc/design/secure-cluster-traffic.md
Outdated
|
|
||
| Instead of redirecting all best-effort traffic via the reliable channel as | ||
| proposed above, one could also secure the best-effort channel itself using UDP | ||
| and [DTLS](https://de.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in |
There was a problem hiding this comment.
There was a problem hiding this comment.
Oh, I thought you might want to practice your German a bit while reading the design document. What better to read than a network protocol specification document in German :P
Thanks @simonpasquier
f78d1a8 to
2a661b6
Compare
|
haven't had a chance to look yet, but will very soon. sorry for the delay! |
doc/design/secure-cluster-traffic.md
Outdated
| ideally done in an eventual consistent gossip fashion, given that Alertmanager | ||
| is supposed to scale beyond a 3-instance cluster and beyond local-area-network | ||
| deployments. With these requirements in mind, replacing _Memberlist_ with an | ||
| entirely self-build communication layer is a great undertaking. |
| encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via | ||
| AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling | ||
| updates. Securing the cluster traffic via symmetric encryption would just | ||
| involve small configuration changes in the Alertmanager code base. |
There was a problem hiding this comment.
If both methods require generating a key, what is the downside of this method vs. the proposed method?
There was a problem hiding this comment.
I think that this would be a valid approach -- but we would need to
- add amtool genkey command
- specify in the doc that we use that library and that this form of encryption could change in the future
There was a problem hiding this comment.
And it'd be a different way of doing auth than we're going to use elsewhere.
There was a problem hiding this comment.
Could we contribute our approach upstream?
There was a problem hiding this comment.
If both methods require generating a key, what is the downside of this method vs. the proposed method?
@stuartnelson3 sorry for not covering that properly in the document:
- asymmetric vs symmetric: TLS gives users more possible trust structures e.g. different certificate hierarchies, enabling users to exclude a specific (bad) alertmanager instance.
- default vs one-off: Symmetric crypto is easier to setup in itself, but probably not the default security option for most users, hence a one-off solution. I would expect most operators to already have a public key infrastructure for tls in place (please correct me if I am wrong).
- replay attacks: Given that memberlists symmetric crypto operates on unordered channel (UDP) I don't see how it can prevent replay attacks. TLS runs on top of TCP which would discard out of order messages of a replay attack.
- consistency with Prometheus: As @brian-brazil said, the suggested method would keep Alertmanager consistent with the rest of the stack.
What are your thoughts @stuartnelson3?
There was a problem hiding this comment.
Could we contribute our approach upstream?
Which one?
-
Symetric: It is already part of memberlists core.
-
TLSTransport: The
TLSTransportlogic implementing theTransportinterface could be suggested to be added as an alternative to Memberlist'sNetTransport. AsTLSTransportdoes not alter any Memberlist code, I would say this is not critical.
Does that answer the question @roidelapluie?
There was a problem hiding this comment.
My initial thought was wondering about the cost of developing and maintaining our own transport (and being consistent within the prometheus org) vs. using the keyring (and being inconsistent).
The points you list here seem like enough to warrant creating our own Transport.
There was a problem hiding this comment.
replay attacks: Given that memberlists symmetric crypto operates on unordered channel (UDP) I don't see how it can prevent replay attacks. TLS runs on top of TCP which would discard out of order messages of a replay attack.
Re-reading the DTLS RFC, it does prevent replay attacks via an epoch and sequence number. I am sorry for the confusion.
2a661b6 to
f1f9a91
Compare
Is this something we want to add for users? How is prometheus handling this? I'm out of the loop on these efforts. |
It'll be up to the user to deal with cert stuff. |
simonpasquier
left a comment
There was a problem hiding this comment.
The proposal looks ok to me.
|
the genkey is for the"memberlist" approcach. Then we have no TLS, but a key. |
|
Thanks everyone for the input. Any further comments? Otherwise I will merge tonight (CET). |
brian-brazil
left a comment
There was a problem hiding this comment.
👍
Might also be worth talking about how this will work timeline-wise with the other TLS stuff in progress in the node exporter, which seems to be actively worked on.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
f1f9a91 to
d81b9a5
Compare
I have looked into prometheus/node_exporter#1198. Thanks for the hint. Given that the effort in the node_exporter still seems to be in an early phase (correct me if I am wrong), I will try to follow along and give input in regards to compatibility with this proposal. Overall reusing the logic here sounds great to me. Having one consistent way of doing TLS across the project sounds great. |
It's essential in my mind, security-related code is not something you want to be copy&pasting around. |
As of today the communication between Alertmanager instances in a cluster is send in clear-text.
Instances in a cluster should communicate among each other in a secure fashion. Alertmanager should guarantee confidentiality, integrity and authenticity for each message touching the wire. While this would improve the security of single datacenter deployments, one could see this as a necessity for wide-area-network deployments.
This patch adds a design document to plan the goal above. A prove of concept implementation can be found here.
Ideas, comments, suggestions, ..., are very much appreciated!