Refactor nflog configuration options to make it similar to Silences.#3220
Refactor nflog configuration options to make it similar to Silences.#3220
Conversation
The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval. To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences. Signed-off-by: gotjosh <josue.abreu@gmail.com>
675ee0c to
d17244d
Compare
Signed-off-by: gotjosh <josue.abreu@gmail.com>
13aaeeb to
5e9a323
Compare
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
|
|
||
| // Log holds the notification log state for alerts that have been notified. | ||
| type Log struct { | ||
| clock clock.Clock |
There was a problem hiding this comment.
Most, if not all, of the diff in the tests, has to do with the fact that we now mock the clock instead of injecting. Technically, this can be done in separate PR, but this PR seems simple enough to follow so I decided to include it here.
| func WithNow(f func() time.Time) Option { | ||
| return func(l *Log) error { | ||
| l.now = f | ||
| return nil | ||
| } | ||
| } |
There was a problem hiding this comment.
All of this plumbing and the now func() time.Time on the manager struct was only to help with tests -- this is all replaced by using clock.Clock
| // If not nil, the last argument is an override for what to do as part of the maintenance - for advanced usage. | ||
| func (l *Log) Maintenance(interval time.Duration, snapf string, stopc <-chan struct{}, override MaintenanceFunc) { | ||
| if interval == 0 || stopc == nil { | ||
| level.Error(l.logger).Log("msg", "interval or stop signal are missing - not running maintenance") |
There was a problem hiding this comment.
This is new - it kind of annoyed me that we'd return from this function on silences when you misconfigured the maintenance but fail silently. I don't think we should do this -- ideally, return an error but I settled with a log line to keep the diff sane.
There was a problem hiding this comment.
Works for me. I'd suggest that we check the validity of the maintenance interval in cmd/alertmanager/main.go in a later PR. As of today, a negative maintenance interval triggers a panic...
| return size, err | ||
| } | ||
| if size, err = s.Snapshot(f); err != nil { | ||
| f.Close() |
There was a problem hiding this comment.
This seemed like a bug - we only close on the second return, but we also return here without closing the file descriptor. This case should be very rare but possible nonetheless.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
nflog/nflog.go
Outdated
|
|
||
| if o.SnapshotFile != "" { | ||
| if r, err := os.Open(o.SnapshotFile); err != nil { | ||
| if !os.IsNotExist(err) { |
There was a problem hiding this comment.
Should we maybe log at info-level that a previous log to load up didn't exist, so it'll create one? Or, maybe the inverse path, something like "Loading a previous snapshot..."
At a minimum, we should log it at debug level, since there are sort of two independent paths here. If for some reason a file can't be accessed, things could end up in a weird state, and there is no evidence trail left behind that it happened.
There was a problem hiding this comment.
A debug log sounds good to me. Alertmanager would hit this code path when it starts from scratch.
| nflog.WithMaintenance(*maintenanceInterval, stopc, wg.Done, nil), | ||
| nflog.WithMetrics(prometheus.DefaultRegisterer), | ||
| nflog.WithLogger(log.With(logger, "component", "nflog")), | ||
| notificationLogOpts := nflog.Options{ |
| SnapshotReader io.Reader | ||
| SnapshotFile string |
There was a problem hiding this comment.
Let's add a doc-comment to each, saying only one of these fields should be set. That way, callers can see this in their editors directly - just makes it a little more convenient to use.
There was a problem hiding this comment.
The top-level comment reads as:
// A snapshot file or reader from which the initial state is loaded.
// None or only one of them must be set.
In my editor it shows as:
Which indicates:
// None or only one of them must be set.
Are you thinking of something different? Happy to do it, but unsure of what this means for a different editor 🤔
| return size, err | ||
| } | ||
| if size, err = s.Snapshot(f); err != nil { | ||
| f.Close() |
| // If not nil, the last argument is an override for what to do as part of the maintenance - for advanced usage. | ||
| func (l *Log) Maintenance(interval time.Duration, snapf string, stopc <-chan struct{}, override MaintenanceFunc) { | ||
| if interval == 0 || stopc == nil { | ||
| level.Error(l.logger).Log("msg", "interval or stop signal are missing - not running maintenance") |
There was a problem hiding this comment.
Works for me. I'd suggest that we check the validity of the maintenance interval in cmd/alertmanager/main.go in a later PR. As of today, a negative maintenance interval triggers a panic...
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Simon Pasquier <pasquier.simon@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
…rometheus#3220) * Refactor nflog configuration options to make it similar to Silences. The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval. To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.
…rometheus#3220) * Refactor nflog configuration options to make it similar to Silences. The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval. To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.
To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and
runto be similar to the silences.