Secure cluster traffic via mutual TLS#2237
Conversation
b9f1ea4 to
36e6326
Compare
|
This PR should not be considered for merging until the tls code is in common, it is not planned to ever have more than 1 copy of that code. |
|
Happy to have the rest of it reviewed. We can clean up the TLS-Config code before merging. |
mxinden
left a comment
There was a problem hiding this comment.
Great that this is happening!
I added two comments inline.
In my proof-of-concept I extracted the memberlist specific logic to a separate project for other memberlist users to reuse. What do you think about doing the same here? Is the additional complexity worth the effort?
cluster/tls_connection.go
Outdated
|
|
||
| // writePacket writes all the bytes in one operation so no concurrent write happens in between. | ||
| // It prefixes the connection type, the from address and the message length. | ||
| func writePacket(conn net.Conn, fromAddr string, b []byte) error { |
There was a problem hiding this comment.
To future-proof this protocol what do you think of sending a version number down the wire?
Protobuf would do length delimiting and versioning. Do you think that would be overkill?
There was a problem hiding this comment.
We made a change to use protobuf for this. Any thoughts? Are there ways we could improve it? Thanks!
cluster/tls_transport.go
Outdated
| // writes to it, and closes it. It also returns a time stamp of when | ||
| // the packet was written. | ||
| func (t *TLSTransport) WriteTo(b []byte, addr string) (time.Time, error) { | ||
| dialer := &net.Dialer{Timeout: DefaultTcpTimeout} |
There was a problem hiding this comment.
Do I understand correctly, that this is opening up a new TCP connection for each packet to send? What do you think of something along the lines of a connection pool?
There was a problem hiding this comment.
It adds additional complexity as we need to manage the lifecycle of the connection and the Memberlist does not provide any good way to handle it. I would not recommend using a connection pool in this case.
In our testing, we did not see any issues yet. If you like we can run additional load tests.
There was a problem hiding this comment.
What sort of network latency did the testing have?
There was a problem hiding this comment.
An individual request is taking less than 3 mill seconds. Here are the benchmark results
goos: darwin
goarch: amd64
pkg: github.com/prometheus/alertmanager/cluster
BenchmarkWriteTo-12 1000000000 0.00243 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/prometheus/alertmanager/cluster 0.111s
Success: Benchmarks passed.
There was a problem hiding this comment.
The question isn't about how long things take over localhost, it's how many round trips may be required over a 200ms+ connection.
There was a problem hiding this comment.
Oh got it. I will run a test against remotely deployed Alertmanger next week. If we see any issue we will implement a connection pool.
Thank you.
There was a problem hiding this comment.
We've added a connection pool :)
We already moved tsdb back inside of Prometheus to avoid dealing with version upgrades and such across projects. It is up to the maintainers of Alertmanager, but I would keep the lessons of tsdb in mind. |
b249731 to
edac61c
Compare
d85cd6e to
b79abcc
Compare
8458f49 to
e79d7c4
Compare
d3fcaa4 to
86140f9
Compare
|
I've made the requested updates. I'd appreciate another review @brian-brazil @mxinden @csmarchbanks. Thanks! |
| if err != nil { | ||
| return nil, err | ||
| } | ||
| pool.cache.Add(key, conn) |
There was a problem hiding this comment.
@csmarchbanks I updated this to use an lru cache. I removed the mutex from this file because the cache provides locking. However, I'm wondering if I still need it between the pool.cache.Get and the pool.cache.Add. Thoughts?
There was a problem hiding this comment.
I think yes as two concurrent operations could each dial a TLS connection and then add the connection for the same key?
There was a problem hiding this comment.
You might be able to use PeekOrAdd to avoid a mutex, but it would also have a bit of extra complexity. Basically:
- Do the get
- If not found, dial a TLS connection
- Use PeekOrAdd
- If found, use the "old" connection returned by PeekOrAdd, if added use the created connection.
Whichever way looks cleaner is fine by me.
There was a problem hiding this comment.
I tried both ways. In the end, the mutex ended looking cleaner.
The PeekOrAdd would have been nice if I didn't always need to check whether the connection is alive when I get it back from the cache.
Co-authored-by: Sharad Gaur <sharadgaur@gmail.com> Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
Signed-off-by: Dustin Hooten <dustinhooten@gmail.com>
csmarchbanks
left a comment
There was a problem hiding this comment.
Re-ran the tests and they pass now, looks like a flaky acceptance test. 👍 from me, I will wait for Monday to see if any comments/concerns come through and will then plan to merge this!
|
🎉 Thanks @csmarchbanks @gotjosh @beorn7 @mxinden @brian-brazil @sharadgaur ! |
|
Thanks @hooten for all your work and commitment on this PR! |
|
+100, this is awesome, thank you! |
* Add TLS option to gossip cluster Co-authored-by: Sharad Gaur <sharadgaur@gmail.com> Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * generate new certs that expire in 100 years Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Fix tls_connection attributes Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Improve error message Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Fix tls client config docs Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Add capacity arg to message buffer Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * fix formatting Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Update version; add version validation Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * use lru cache for connection pool Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * lock reading from the connection Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * when extracting net.Conn from tlsConn, lock and throw away wrapper Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * Add mutex to connection pool to protect cache Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> * fix linting Signed-off-by: Dustin Hooten <dustinhooten@gmail.com> Co-authored-by: Sharad Gaur <sharadgaur@gmail.com>
|
Hello! Thanks a lot for adding this feature, seems like a very great thing! I have a few questions;
I would highly appreciate advice |
|
|
@Scrin Thanks a lot for reply! I am not using a ready made docker image, I'm building my own. At this point I get an error Maybe I've missed a step in installation process? Update: |
|
0.23.0 was released before this was merged, so it's not available in that version yet, and the next version doesn't seem to be released yet. If you build AM from the main branch (instead of the v0.23.0 tag) you should get this feature (Docker users can use |
| }{ | ||
| {bindAddr: localhost, bindPort: 9094, inputIp: "10.0.0.5", inputPort: 54231, expectedIp: "10.0.0.5", expectedPort: 54231}, | ||
| {bindAddr: localhost, bindPort: 9093, inputIp: "invalid", inputPort: 54231, expectedError: "failed to parse advertise address \"invalid\""}, | ||
| {bindAddr: "0.0.0.0", bindPort: 0, inputIp: "", inputPort: 0, expectedIp: "random"}, |
There was a problem hiding this comment.
Such a test is not very packager-friendly. Most distros will run unit tests in clean room environment when packaging, which can often mean a host with only a loopback interface (with 127.0.0.1 and usually [::1] configured), and no default route.
On Debian buildbots, and presumably other distros with similar clean room package building policy, this fails with:
=== RUN TestFinalAdvertiseAddr
tls_transport_test.go:89:
Error Trace: /<<PKGBUILDDIR>>/build/src/github.com/prometheus/alertmanager/cluster/tls_transport_test.go:89
Error: Expected nil, but got: &errors.errorString{s:"no private IP address found, and explicit IP not provided"}
Test: TestFinalAdvertiseAddr
--- FAIL: TestFinalAdvertiseAddr (0.00s)
This is due to the bindAddr being "0.0.0.0" and the inputIp being empty.
I see that TestClusterJoinAndReconnect in cluster/cluster_test.go at least checks first whether a private IP exists, and skips the test if none is found. Perhaps it makes sense to omit this particular test case in such a scenario also.
There was a problem hiding this comment.
I see that TestClusterJoinAndReconnect in cluster/cluster_test.go at least checks first whether a private IP exists, and skips the test if none is found. Perhaps it makes sense to omit this particular test case in such a scenario also.
I agree and the skip in cluster_test.go exists exactly for the same constraint (#1445).
Co-authored-by: Sharad Gaur sharadgaur@gmail.com
Signed-off-by: Dustin Hooten dustinhooten@gmail.com
This pull request makes it possible to use mutual TLS for cluster communications.
By adding the path to a TLS configuration file using the command line flag
--cluster.tls-config, TLS will be used for inter-peer gossip.Mutual TLS is achieved by configuring Memberlist to use a TLS implementation of the
memberlist.Transportinterface. This approach has been submitted in a design doc and implemented as a proof-of-concept in #1819. This pull request continues that work.Closes #1322
Updates December 30, 2020: TLS server config uses common code from
prometheus/exporter-toolkit.Updates January 2021: TLS config also allows client config using code from
prometheus/common.