Skip to content

Comments

doc: Add 'Secure Alertmanager cluster traffic' design document#1763

Merged
mxinden merged 1 commit intoprometheus:masterfrom
mxinden:design-secure-memberlist
Mar 5, 2019
Merged

doc: Add 'Secure Alertmanager cluster traffic' design document#1763
mxinden merged 1 commit intoprometheus:masterfrom
mxinden:design-secure-memberlist

Conversation

@mxinden
Copy link
Member

@mxinden mxinden commented Feb 21, 2019

As of today the communication between Alertmanager instances in a cluster is send in clear-text.

Instances in a cluster should communicate among each other in a secure fashion. Alertmanager should guarantee confidentiality, integrity and authenticity for each message touching the wire. While this would improve the security of single datacenter deployments, one could see this as a necessity for wide-area-network deployments.

This patch adds a design document to plan the goal above. A prove of concept implementation can be found here.

Ideas, comments, suggestions, ..., are very much appreciated!

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds sane. How would this tie into the TLS stuff ongoing in the node exporter currently, and would/could it use the existing http client config libraries we have rather than having to reinvent all those settings?

- Replicate notification log

As of today the communication between Alertmanager instances in a cluster is
send in clear-text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sent

TCP connections could be kept alive beyond a single message to reduce latency as
well as handshake overhead costs. While this is feasible in a 3-instance
Alertmanager cluster, the discussed custom implementation would need to limit
the amount of open connections for clusters with many instances (#connections =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is one per other AM really a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memberlist wants to have one connection as a reliable connection all to itself. Thereby we need at least two, one reliable TCP and one pseudo-best-effort connection unless we want to go down the road of multiplexing a single TCP connection.

@brian-brazil what maximum cluster size would you expect in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle someone might run two per datacenter, and tens of datacenters isn't that unusual. Say 100?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I will make sure to include that in the performance testing (in case we decide for this route).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "full sync" tcp request happens relatively infrequently, and send reliable is only used for especially large gossip messages (which is probably also relatively infrequent. it happens <<1% of the time at SC). Practically speaking, each instance would only maintain one connection to the other instances.

instead of the best-effort UDP connection to gossip large notification logs and
silences between instances. The reason is, that those packages would otherwise
exceed the [MTU](https://de.wikipedia.org/wiki/Maximum_Transmission_Unit) of
most UDP setups. Splitting packages is not supported by _Memberlist_ and was not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packets

@mxinden mxinden force-pushed the design-secure-memberlist branch from 22c7569 to 174af3a Compare February 21, 2019 12:53

Instead of redirecting all best-effort traffic via the reliable channel as
proposed above, one could also secure the best-effort channel itself using UDP
and [DTLS](https://de.wikipedia.org/wiki/Datagram_Transport_Layer_Security) in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought you might want to practice your German a bit while reading the design document. What better to read than a network protocol specification document in German :P

Thanks @simonpasquier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂

@mxinden mxinden force-pushed the design-secure-memberlist branch 2 times, most recently from f78d1a8 to 2a661b6 Compare February 22, 2019 09:55
@stuartnelson3
Copy link
Contributor

haven't had a chance to look yet, but will very soon. sorry for the delay!

ideally done in an eventual consistent gossip fashion, given that Alertmanager
is supposed to scale beyond a 3-instance cluster and beyond local-area-network
deployments. With these requirements in mind, replacing _Memberlist_ with an
entirely self-build communication layer is a great undertaking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-built

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

encryption](https://godoc.org/github.com/hashicorp/memberlist#Keyring) via
AES-128, AES-192 or AES-256 ciphers. One can specify multiple keys for rolling
updates. Securing the cluster traffic via symmetric encryption would just
involve small configuration changes in the Alertmanager code base.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both methods require generating a key, what is the downside of this method vs. the proposed method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this would be a valid approach -- but we would need to

  • add amtool genkey command
  • specify in the doc that we use that library and that this form of encryption could change in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it'd be a different way of doing auth than we're going to use elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we contribute our approach upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both methods require generating a key, what is the downside of this method vs. the proposed method?

@stuartnelson3 sorry for not covering that properly in the document:

  • asymmetric vs symmetric: TLS gives users more possible trust structures e.g. different certificate hierarchies, enabling users to exclude a specific (bad) alertmanager instance.
  • default vs one-off: Symmetric crypto is easier to setup in itself, but probably not the default security option for most users, hence a one-off solution. I would expect most operators to already have a public key infrastructure for tls in place (please correct me if I am wrong).
  • replay attacks: Given that memberlists symmetric crypto operates on unordered channel (UDP) I don't see how it can prevent replay attacks. TLS runs on top of TCP which would discard out of order messages of a replay attack.
  • consistency with Prometheus: As @brian-brazil said, the suggested method would keep Alertmanager consistent with the rest of the stack.

What are your thoughts @stuartnelson3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we contribute our approach upstream?

Which one?

  • Symetric: It is already part of memberlists core.

  • TLSTransport: The TLSTransport logic implementing the Transport interface could be suggested to be added as an alternative to Memberlist's NetTransport. As TLSTransport does not alter any Memberlist code, I would say this is not critical.

Does that answer the question @roidelapluie?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought was wondering about the cost of developing and maintaining our own transport (and being consistent within the prometheus org) vs. using the keyring (and being inconsistent).

The points you list here seem like enough to warrant creating our own Transport.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replay attacks: Given that memberlists symmetric crypto operates on unordered channel (UDP) I don't see how it can prevent replay attacks. TLS runs on top of TCP which would discard out of order messages of a replay attack.

Re-reading the DTLS RFC, it does prevent replay attacks via an epoch and sequence number. I am sorry for the confusion.

@mxinden mxinden force-pushed the design-secure-memberlist branch from 2a661b6 to f1f9a91 Compare February 25, 2019 10:18
@stuartnelson3
Copy link
Contributor

I would expect most operators to already have a public key infrastructure for tls in place (please correct me if I am wrong)

add amtool genkey command

Is this something we want to add for users? How is prometheus handling this? I'm out of the loop on these efforts.

@brian-brazil
Copy link
Contributor

Is this something we want to add for users? How is prometheus handling this? I'm out of the loop on these efforts.

It'll be up to the user to deal with cert stuff.

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal looks ok to me.

@roidelapluie
Copy link
Member

the genkey is for the"memberlist" approcach. Then we have no TLS, but a key.

@mxinden
Copy link
Member Author

mxinden commented Feb 28, 2019

Thanks everyone for the input. Any further comments? Otherwise I will merge tonight (CET).

Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Might also be worth talking about how this will work timeline-wise with the other TLS stuff in progress in the node exporter, which seems to be actively worked on.

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
@mxinden mxinden force-pushed the design-secure-memberlist branch from f1f9a91 to d81b9a5 Compare March 1, 2019 15:13
@mxinden
Copy link
Member Author

mxinden commented Mar 1, 2019

Might also be worth talking about how this will work timeline-wise with the other TLS stuff in progress in the node exporter, which seems to be actively worked on.

I have looked into prometheus/node_exporter#1198. Thanks for the hint. Given that the effort in the node_exporter still seems to be in an early phase (correct me if I am wrong), I will try to follow along and give input in regards to compatibility with this proposal. Overall reusing the logic here sounds great to me. Having one consistent way of doing TLS across the project sounds great.

@brian-brazil
Copy link
Contributor

Having one consistent way of doing TLS across the project sounds great.

It's essential in my mind, security-related code is not something you want to be copy&pasting around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants