Secure cluster traffic via mutual TLS by hooten · Pull Request #2237 · prometheus/alertmanager

hooten · 2020-04-20T17:04:19Z

Co-authored-by: Sharad Gaur sharadgaur@gmail.com
Signed-off-by: Dustin Hooten dustinhooten@gmail.com

This pull request makes it possible to use mutual TLS for cluster communications.

By adding the path to a TLS configuration file using the command line flag --cluster.tls-config, TLS will be used for inter-peer gossip.

Mutual TLS is achieved by configuring Memberlist to use a TLS implementation of the memberlist.Transport interface. This approach has been submitted in a design doc and implemented as a proof-of-concept in #1819. This pull request continues that work.

Closes #1322

Updates December 30, 2020: TLS server config uses common code from prometheus/exporter-toolkit.
Updates January 2021: TLS config also allows client config using code from prometheus/common.

brian-brazil · 2020-04-20T19:39:15Z

This PR should not be considered for merging until the tls code is in common, it is not planned to ever have more than 1 copy of that code.

hooten · 2020-04-20T20:11:25Z

Happy to have the rest of it reviewed. We can clean up the TLS-Config code before merging.

mxinden

Great that this is happening!

I added two comments inline.

In my proof-of-concept I extracted the memberlist specific logic to a separate project for other memberlist users to reuse. What do you think about doing the same here? Is the additional complexity worth the effort?

mxinden · 2020-04-27T07:59:33Z

cluster/tls_connection.go

+
+// writePacket writes all the bytes in one operation so no concurrent write happens in between.
+// It prefixes the connection type, the from address and the message length.
+func writePacket(conn net.Conn, fromAddr string, b []byte) error {


To future-proof this protocol what do you think of sending a version number down the wire?

Protobuf would do length delimiting and versioning. Do you think that would be overkill?

Great idea!

We made a change to use protobuf for this. Any thoughts? Are there ways we could improve it? Thanks!

mxinden · 2020-04-27T08:04:05Z

cluster/tls_transport.go

+// writes to it, and closes it. It also returns a time stamp of when
+// the packet was written.
+func (t *TLSTransport) WriteTo(b []byte, addr string) (time.Time, error) {
+	dialer := &net.Dialer{Timeout: DefaultTcpTimeout}


Do I understand correctly, that this is opening up a new TCP connection for each packet to send? What do you think of something along the lines of a connection pool?

It adds additional complexity as we need to manage the lifecycle of the connection and the Memberlist does not provide any good way to handle it. I would not recommend using a connection pool in this case.
In our testing, we did not see any issues yet. If you like we can run additional load tests.

What sort of network latency did the testing have?

An individual request is taking less than 3 mill seconds. Here are the benchmark results
goos: darwin
goarch: amd64
pkg: github.com/prometheus/alertmanager/cluster
BenchmarkWriteTo-12 1000000000 0.00243 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/prometheus/alertmanager/cluster 0.111s
Success: Benchmarks passed.

The question isn't about how long things take over localhost, it's how many round trips may be required over a 200ms+ connection.

Oh got it. I will run a test against remotely deployed Alertmanger next week. If we see any issue we will implement a connection pool.
Thank you.

We've added a connection pool :)

csmarchbanks · 2020-04-29T20:02:35Z

In my proof-of-concept I extracted the memberlist specific logic to a separate project for other memberlist users to reuse. What do you think about doing the same here? Is the additional complexity worth the effort?

We already moved tsdb back inside of Prometheus to avoid dealing with version upgrades and such across projects. It is up to the maintainers of Alertmanager, but I would keep the lessons of tsdb in mind.

hooten · 2021-01-12T19:15:51Z

I've made the requested updates. I'd appreciate another review @brian-brazil @mxinden @csmarchbanks. Thanks!

hooten · 2021-07-29T23:05:12Z

cluster/connection_pool.go

+	if err != nil {
+		return nil, err
+	}
+	pool.cache.Add(key, conn)


@csmarchbanks I updated this to use an lru cache. I removed the mutex from this file because the cache provides locking. However, I'm wondering if I still need it between the pool.cache.Get and the pool.cache.Add. Thoughts?

I think yes as two concurrent operations could each dial a TLS connection and then add the connection for the same key?

You might be able to use PeekOrAdd to avoid a mutex, but it would also have a bit of extra complexity. Basically:

Do the get

If not found, dial a TLS connection

Use PeekOrAdd

If found, use the "old" connection returned by PeekOrAdd, if added use the created connection.

Whichever way looks cleaner is fine by me.

I tried both ways. In the end, the mutex ended looking cleaner.

The PeekOrAdd would have been nice if I didn't always need to check whether the connection is alive when I get it back from the cache.

Co-authored-by: Sharad Gaur <sharadgaur@gmail.com> Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

csmarchbanks

Re-ran the tests and they pass now, looks like a flaky acceptance test. 👍 from me, I will wait for Monday to see if any comments/concerns come through and will then plan to merge this!

hooten · 2021-08-10T14:59:35Z

🎉 Thanks @csmarchbanks @gotjosh @beorn7 @mxinden @brian-brazil @sharadgaur !

csmarchbanks · 2021-08-10T16:14:00Z

Thanks @hooten for all your work and commitment on this PR!

markmsmith · 2021-08-12T15:15:23Z

+100, this is awesome, thank you!
Now that this has landed, is there a release planned soon, or are there any other blocking issues?

* Add TLS option to gossip cluster Co-authored-by: Sharad Gaur <sharadgaur@gmail.com> Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * generate new certs that expire in 100 years Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Fix tls_connection attributes Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Improve error message Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Fix tls client config docs Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Add capacity arg to message buffer Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * fix formatting Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Update version; add version validation Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * use lru cache for connection pool Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * lock reading from the connection Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * when extracting net.Conn from tlsConn, lock and throw away wrapper Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Add mutex to connection pool to protect cache Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * fix linting Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> Co-authored-by: Sharad Gaur <sharadgaur@gmail.com>

osipov-vadim · 2022-03-04T21:23:40Z

Hello! Thanks a lot for adding this feature, seems like a very great thing!

I have a few questions;

What format would TLS Config file be? I tried to follow up with few formats, and it doesn't seem that those are the right ones.
Considering question above , when I pass --cluster.tls-config flag down to AM, it fails to start. Error is: level=error ts=2022-03-04T21:21:25.340Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=alertmanager.yml err="open alertmanager.yml: no such file or directory" Where file=alertmanager.yml is not the path that was specified by a line to start it. Would that be because config file possibly wrong?
Is this feature enabled in by default in Alertmanager, or I need to download additional libraries? Not entirely clear.

I would highly appreciate advice

Scrin · 2022-03-04T22:08:03Z

An example TLS config file can be found here
The main config file (typically alertmanager.yml, provided via --config.file) is different form the TLS config file, make sure you are not mixing them up or missing the main config file. Something like:
/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --cluster.tls-config=/etc/alertmanager/tls.yml (and any other cli flags) should work
I believe it should be enabled by default in AM if you provide a valid configuration, at least on the docker image the config is all you need

osipov-vadim · 2022-03-07T17:38:27Z

@Scrin Thanks a lot for reply!
Yeah, seems that I totally did not see the document there. Though was googling and searching for the tls config file example. Exactly what I needed!

I am not using a ready made docker image, I'm building my own. At this point I get an error alertmanager: error: unknown long flag '--cluster.tls-config', try --help where --help does not list --cluster.tls-config flag at all. Alertmanager version 0.23.0

Maybe I've missed a step in installation process?

Update:
Also tried passing the --cluster.tls-config to this image:
docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager
seem to get exactly same thing, AM doesn't know about --cluster.tls-config flag.

Scrin · 2022-03-08T07:31:15Z

0.23.0 was released before this was merged, so it's not available in that version yet, and the next version doesn't seem to be released yet. If you build AM from the main branch (instead of the v0.23.0 tag) you should get this feature (Docker users can use quay.io/prometheus/alertmanager:main or prom/alertmanager:main docker image for that)

dswarbrick · 2022-09-19T23:59:25Z

cluster/tls_transport_test.go

+	}{
+		{bindAddr: localhost, bindPort: 9094, inputIp: "10.0.0.5", inputPort: 54231, expectedIp: "10.0.0.5", expectedPort: 54231},
+		{bindAddr: localhost, bindPort: 9093, inputIp: "invalid", inputPort: 54231, expectedError: "failed to parse advertise address \"invalid\""},
+		{bindAddr: "0.0.0.0", bindPort: 0, inputIp: "", inputPort: 0, expectedIp: "random"},


Such a test is not very packager-friendly. Most distros will run unit tests in clean room environment when packaging, which can often mean a host with only a loopback interface (with 127.0.0.1 and usually [::1] configured), and no default route.

On Debian buildbots, and presumably other distros with similar clean room package building policy, this fails with:

=== RUN TestFinalAdvertiseAddr tls_transport_test.go:89: Error Trace: /<<PKGBUILDDIR>>/build/src/github.com/prometheus/alertmanager/cluster/tls_transport_test.go:89 Error: Expected nil, but got: &errors.errorString{s:"no private IP address found, and explicit IP not provided"} Test: TestFinalAdvertiseAddr --- FAIL: TestFinalAdvertiseAddr (0.00s)

This is due to the bindAddr being "0.0.0.0" and the inputIp being empty.

I see that TestClusterJoinAndReconnect in cluster/cluster_test.go at least checks first whether a private IP exists, and skips the test if none is found. Perhaps it makes sense to omit this particular test case in such a scenario also.

I see that TestClusterJoinAndReconnect in cluster/cluster_test.go at least checks first whether a private IP exists, and skips the test if none is found. Perhaps it makes sense to omit this particular test case in such a scenario also.

I agree and the skip in cluster_test.go exists exactly for the same constraint (#1445).

hooten force-pushed the tls branch 4 times, most recently from b9f1ea4 to 36e6326 Compare April 20, 2020 17:48

mxinden reviewed Apr 27, 2020

View reviewed changes

sharadgaur force-pushed the tls branch 3 times, most recently from b249731 to edac61c Compare May 5, 2020 16:44

hooten force-pushed the tls branch 2 times, most recently from d85cd6e to b79abcc Compare May 5, 2020 17:15

sharadgaur force-pushed the tls branch 3 times, most recently from 8458f49 to e79d7c4 Compare May 6, 2020 01:55

hooten force-pushed the tls branch from b100d53 to 25d7851 Compare May 6, 2020 21:39

stale bot added the stale label Jul 5, 2020

mxinden mentioned this pull request Oct 9, 2020

Securing the gossip protocol? #1322

Closed

stale bot removed the stale label Dec 30, 2020

hooten force-pushed the tls branch 8 times, most recently from d3fcaa4 to 86140f9 Compare January 1, 2021 18:52

hooten requested a review from brian-brazil January 12, 2021 19:16

hooten force-pushed the tls branch from 1adc793 to ddd23e7 Compare July 29, 2021 22:08

hooten commented Jul 29, 2021

View reviewed changes

hooten and others added 12 commits August 5, 2021 16:50

Add TLS option to gossip cluster

0d8f1cf

Co-authored-by: Sharad Gaur <sharadgaur@gmail.com> Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

generate new certs that expire in 100 years

51b7066

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Fix tls_connection attributes

4c605d1

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Improve error message

2049917

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Fix tls client config docs

eeb9e8a

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Add capacity arg to message buffer

6d13ed5

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

fix formatting

c7236e7

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Update version; add version validation

7b1a2ec

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

use lru cache for connection pool

747a938

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

lock reading from the connection

cb3f6e0

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

when extracting net.Conn from tlsConn, lock and throw away wrapper

46a162a

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

Add mutex to connection pool to protect cache

33815ad

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

hooten force-pushed the tls branch from f4b38da to 33815ad Compare August 5, 2021 22:52

fix linting

9c14ae2

Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>

csmarchbanks approved these changes Aug 6, 2021

View reviewed changes

csmarchbanks merged commit ff85bec into prometheus:main Aug 9, 2021

hooten deleted the tls branch August 10, 2021 01:41

pracucci mentioned this pull request Sep 15, 2021

Upgrade Prometheus grafana/mimir#233

Merged

3 tasks

mxinden mentioned this pull request Oct 3, 2021

[WIP] cluster: Secure cluster traffic via mutual TLS #1819

Closed

dswarbrick reviewed Sep 19, 2022

View reviewed changes

Comments

Conversation

hooten commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-brazil commented Apr 20, 2020

Uh oh!

hooten commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mxinden left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sharadgaur Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sharadgaur May 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks commented Apr 29, 2020

Uh oh!

hooten commented Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

hooten commented Aug 10, 2021

Uh oh!

csmarchbanks commented Aug 10, 2021

Uh oh!

markmsmith commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

osipov-vadim commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Scrin commented Mar 4, 2022

Uh oh!

osipov-vadim commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Scrin commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

hooten commented Apr 20, 2020 •

edited

Loading

hooten commented Apr 20, 2020 •

edited

Loading

sharadgaur Apr 29, 2020 •

edited

Loading

sharadgaur May 1, 2020 •

edited

Loading

hooten commented Jan 12, 2021 •

edited

Loading

markmsmith commented Aug 12, 2021 •

edited

Loading

osipov-vadim commented Mar 4, 2022 •

edited

Loading

osipov-vadim commented Mar 7, 2022 •

edited

Loading

Scrin commented Mar 8, 2022 •

edited

Loading