Refactor nflog configuration options to make it similar to Silences. by gotjosh · Pull Request #3220 · prometheus/alertmanager

gotjosh · 2023-01-18T19:48:58Z

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval.

To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and run to be similar to the silences.

The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval. To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences. Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh · 2023-01-18T20:17:28Z

nflog/nflog.go


+// Log holds the notification log state for alerts that have been notified.
 type Log struct {
+	clock clock.Clock


Most, if not all, of the diff in the tests, has to do with the fact that we now mock the clock instead of injecting. Technically, this can be done in separate PR, but this PR seems simple enough to follow so I decided to include it here.

gotjosh · 2023-01-18T20:18:22Z

nflog/nflog.go

-func WithNow(f func() time.Time) Option {
-	return func(l *Log) error {
-		l.now = f
-		return nil
-	}
-}


All of this plumbing and the now func() time.Time on the manager struct was only to help with tests -- this is all replaced by using clock.Clock

gotjosh · 2023-01-18T20:19:52Z

nflog/nflog.go

+// If not nil, the last argument is an override for what to do as part of the maintenance - for advanced usage.
+func (l *Log) Maintenance(interval time.Duration, snapf string, stopc <-chan struct{}, override MaintenanceFunc) {
+	if interval == 0 || stopc == nil {
+		level.Error(l.logger).Log("msg", "interval or stop signal are missing - not running maintenance")


This is new - it kind of annoyed me that we'd return from this function on silences when you misconfigured the maintenance but fail silently. I don't think we should do this -- ideally, return an error but I settled with a log line to keep the diff sane.

Works for me. I'd suggest that we check the validity of the maintenance interval in cmd/alertmanager/main.go in a later PR. As of today, a negative maintenance interval triggers a panic...

gotjosh · 2023-01-18T20:21:09Z

silence/silence.go

 			return size, err
 		}
 		if size, err = s.Snapshot(f); err != nil {
+			f.Close()


This seemed like a bug - we only close on the second return, but we also return here without closing the file descriptor. This case should be very rare but possible nonetheless.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

alexweav · 2023-01-19T14:42:38Z

nflog/nflog.go

+
+	if o.SnapshotFile != "" {
+		if r, err := os.Open(o.SnapshotFile); err != nil {
+			if !os.IsNotExist(err) {


Should we maybe log at info-level that a previous log to load up didn't exist, so it'll create one? Or, maybe the inverse path, something like "Loading a previous snapshot..."

At a minimum, we should log it at debug level, since there are sort of two independent paths here. If for some reason a file can't be accessed, things could end up in a weird state, and there is no evidence trail left behind that it happened.

A debug log sounds good to me. Alertmanager would hit this code path when it starts from scratch.

alexweav · 2023-01-19T14:45:18Z

cmd/alertmanager/main.go

-		nflog.WithMaintenance(*maintenanceInterval, stopc, wg.Done, nil),
-		nflog.WithMetrics(prometheus.DefaultRegisterer),
-		nflog.WithLogger(log.With(logger, "component", "nflog")),
+	notificationLogOpts := nflog.Options{


This is a nice change 👍

alexweav · 2023-01-19T14:46:44Z

nflog/nflog.go

+	SnapshotReader io.Reader
+	SnapshotFile   string


Let's add a doc-comment to each, saying only one of these fields should be set. That way, callers can see this in their editors directly - just makes it a little more convenient to use.

The top-level comment reads as:

// A snapshot file or reader from which the initial state is loaded. // None or only one of them must be set.

In my editor it shows as:

Which indicates:

// None or only one of them must be set.

Are you thinking of something different? Happy to do it, but unsure of what this means for a different editor 🤔

simonpasquier

lgtm

simonpasquier · 2023-01-19T13:18:24Z

silence/silence.go

 			return size, err
 		}
 		if size, err = s.Snapshot(f); err != nil {
+			f.Close()


simonpasquier · 2023-01-19T15:26:14Z

nflog/nflog.go

+// If not nil, the last argument is an override for what to do as part of the maintenance - for advanced usage.
+func (l *Log) Maintenance(interval time.Duration, snapf string, stopc <-chan struct{}, override MaintenanceFunc) {
+	if interval == 0 || stopc == nil {
+		level.Error(l.logger).Log("msg", "interval or stop signal are missing - not running maintenance")


Works for me. I'd suggest that we check the validity of the maintenance interval in cmd/alertmanager/main.go in a later PR. As of today, a negative maintenance interval triggers a panic...

Signed-off-by: gotjosh <josue.abreu@gmail.com>

nflog/nflog.go

Co-authored-by: Simon Pasquier <pasquier.simon@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

…rometheus#3220) * Refactor nflog configuration options to make it similar to Silences. The Notification Log is a similar component to Silences. They're the only two things that are shared between nodes when running in HA and they both hold some sort of internal state that needs to be cleaned up on an interval. To simplify the code and make it a bit more understandable (among other benefits such as improved testability) - I've refactor the notification log configuration and `run` to be similar to the silences.

gotjosh force-pushed the gotjosh/nflog-similar-to-silences branch from 675ee0c to d17244d Compare January 18, 2023 19:49

Use l.now() everywhere

5e9a323

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh force-pushed the gotjosh/nflog-similar-to-silences branch from 13aaeeb to 5e9a323 Compare January 18, 2023 19:52

gotjosh added 3 commits January 18, 2023 19:54

lint imports

f770c7d

Signed-off-by: gotjosh <josue.abreu@gmail.com>

more linter fixing

bf303c1

Signed-off-by: gotjosh <josue.abreu@gmail.com>

yet more linting

95450ca

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh requested a review from simonpasquier January 18, 2023 20:14

gotjosh commented Jan 18, 2023

View reviewed changes

gotjosh mentioned this pull request Jan 18, 2023

No calling GC on the notification log as part of the Grafana Alertmanager grafana/alerting#3

Closed

don't leak goroutines on main

1555995

Signed-off-by: gotjosh <josue.abreu@gmail.com>

alexweav reviewed Jan 19, 2023

View reviewed changes

simonpasquier approved these changes Jan 19, 2023

View reviewed changes

Add debug log when fail to load snapshot file

1a4f11f

Signed-off-by: gotjosh <josue.abreu@gmail.com>

simonpasquier reviewed Jan 19, 2023

View reviewed changes

nflog/nflog.go Outdated Show resolved Hide resolved

gotjosh and others added 2 commits January 19, 2023 16:25

Update nflog/nflog.go

9e7fdf9

Co-authored-by: Simon Pasquier <pasquier.simon@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com>

add the smae message to silences

01daf19

Signed-off-by: gotjosh <josue.abreu@gmail.com>

simonpasquier approved these changes Jan 19, 2023

View reviewed changes

gotjosh merged commit f59460b into main Jan 19, 2023

gotjosh deleted the gotjosh/nflog-similar-to-silences branch January 19, 2023 16:39

krishnateja325 mentioned this pull request May 30, 2023

Update prometheus/alertmanager to version v0.25.1-0.20230505130626-263ca5c9438e cortexproject/cortex#5276

Merged

2 tasks

Conversation

gotjosh commented Jan 18, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants