Skip to content

Expose negative matching in silences#2471

Merged
beorn7 merged 6 commits intoprometheus:masterfrom
vladimiroff:issue-1682
Feb 19, 2021
Merged

Expose negative matching in silences#2471
beorn7 merged 6 commits intoprometheus:masterfrom
vladimiroff:issue-1682

Conversation

@vladimiroff
Copy link
Contributor

Closes #1682

@vladimiroff
Copy link
Contributor Author

vladimiroff commented Jan 29, 2021

Odd failure in API v1 acceptance tests, I can not reproduce even after make clean and building everything up. They don't seem flaky, either:

--> GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
ok      github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.161s
-->

@QuentinBisson
Copy link
Contributor

I tried to run it locally as well and I see some timeouts when running it multiple times but I am not sure where to investigate:

--- FAIL: TestReload (5.08s)
    acceptance.go:136: AM on 127.0.0.1:42783
    acceptance.go:311: Starting alertmanager failed: timeout
FAIL
FAIL    github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.103s
FAIL
> GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
ok      github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.126s
> GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
--- FAIL: TestAlwaysInhibiting (5.10s)
    acceptance.go:136: AM on 127.0.0.1:43685
    acceptance.go:311: Starting alertmanager failed: timeout
FAIL
FAIL    github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.111s
FAIL
> GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
--- FAIL: TestRepeat (5.08s)
    acceptance.go:136: AM on 127.0.0.1:35027
    acceptance.go:311: Starting alertmanager failed: timeout
FAIL
FAIL    github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.096s
FAIL

@beorn7 beorn7 self-assigned this Feb 2, 2021
@beorn7
Copy link
Member

beorn7 commented Feb 2, 2021

Great. I'll have a look ASAP (which might not be that soon in absolute terms, sorry… 😥 ).

@asquare14 perhaps you might want to have a look here, too.

@QuentinBisson those timeouts are usually something not related to the code. We have seen the tests in the past to be sensitive to subtle thinks like what platform you are running on. Do you get the same problem if running against the code in the main branch?

@vladimiroff the CI test failure might also just be flakiness, but i doubt it. Too many tests were failing. Do you have an idea what's going on? (I'll rerun the tests right now, and once I get to review this, I will hopefully be able to help in a more meaningful way.)

@beorn7
Copy link
Member

beorn7 commented Feb 2, 2021

🎉 rerunning "fixed" the tests. So I guess we should really work on making the tests less flaky (also see recent #2461 ).

@QuentinBisson
Copy link
Contributor

I am running fedora so this might not be happening on darwin only. I tried it some more times and it is also happening on the master branch:

GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
ok      github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.115s
> GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
ok      github.com/prometheus/alertmanager/test/with_api_v1/acceptance  260.135s
> GO111MODULE=on go test -race  -mod=vendor ./test/with_api_v1/acceptance -count 10
--- FAIL: TestBatching (5.08s)
    acceptance.go:136: AM on 127.0.0.1:46571
    acceptance.go:311: Starting alertmanager failed: timeout
--- FAIL: TestSilencing (5.12s)
    acceptance.go:136: AM on 127.0.0.1:39889
    acceptance.go:311: Starting alertmanager failed: timeout

@beorn7
Copy link
Member

beorn7 commented Feb 2, 2021

@QuentinBisson OK, so that's a problem with the tests in your environment in general. Would be great if you could find out what's happening so that we can improve the tests.

@beorn7
Copy link
Member

beorn7 commented Feb 2, 2021

@roidelapluie raised a point, which I then explored: This PRs allows the actual creation of silences with negative matching. So what happens if you update only one Alertmanager instance to a new version with this feature but leave the other cluster member on the old version? The answer: They would read the silences with negative matchers as regular (i.e. non-regexp) positive matchers. The reason is here: This code doesn't check if m.Type has an unknown value. So the switch statement does nothing, which means mt.Regex stays it its default value false. A fix would be to error out if m.Type has an unknown value.

So what to do now? Ideas:

  1. Don't do anything, just warn users that they should not use silences with negative matchers before they have updated all cluster instances to the new Alertmanager. The bad thing is that users not noticing the warning will essentially create silences that do the exact opposite of what they were intended to do.
  2. Cut a bugfix release for 0.21 that does what I said above. In that case, silences with negative matchers will just error out on the other instances and therefore do nothing. Which is still not good, but this is the more conservative approach.

Of course, even with the 2nd approach, users that haven't upgraded to the bugfix release will still be in the 1st situation.

Thoughts?

@beorn7
Copy link
Member

beorn7 commented Feb 2, 2021

Note that nothing bad will happen in mixed clusters as long as you don't create silences using negative matchers. In a way, we can hope that this new feature doesn't become popular too quickly to limit the damage. (o:

@vladimiroff
Copy link
Contributor Author

vladimiroff commented Feb 4, 2021

Latest commit actually exposes that isEqual field to the UI. The matcher format <name><op><value> is now printed properly and there's an Equal checkbox in silence forms. The checkbox is checked by default in empty create forms and deduces proper values from op in edit and recreate forms (just like the 'Regex' checkbox).

screenshot_20210204_235126

I will highly appreciate thorough review especially on that commit. While I feel fairly confident with the Go part, my experience with Elm boils down to trying to close that issue in a second PR, so far 😅


Also, think I just found another flaky acceptance test (cc: @QuentinBisson)

    TestSilenceDelete - github.com/prometheus/alertmanager/test/with_api_v2/acceptance

    Failed
    === RUN   TestSilenceDelete
    === PAUSE TestSilenceDelete
    === CONT  TestSilenceDelete
    === CONT  TestSilenceDelete
        acceptance.go:186: starting alertmanager cluster: starting alertmanager failed: timeout
        acceptance.go:177: Error sending SIGTERM to Alertmanager process: no such process
    --- FAIL: TestSilenceDelete (5.11s)

@QuentinBisson
Copy link
Contributor

Regarding test flakiness, I could reproduce it running multiple instances in parallel and the only issue I had had to do with tests using a port already in use on my machine. The issue could be due to the following code in the acceptance test and especially the hope part in the comment :D

// freeAddress returns a new listen address not currently in use.
func freeAddress() string {
	// Let the OS allocate a free address, close it and hope
	// it is still free when starting Alertmanager.
	l, err := net.Listen("tcp4", "localhost:0")
	if err != nil {
		panic(err)
	}
	defer func() {
		if err := l.Close(); err != nil {
			panic(err)
		}
	}()

	return l.Addr().String()
}

@beorn7 Regarding your backporting question, I guess it depends on the alertmanager release process. Are there multiple release maintained at the same time (i.e. if 2.22.0 is released, will there still be some work on the 2.21 branch)? If yes, I think we should backport, otherwise, if there is a an upgrade to 2.22.0 and one to 2.21.x, I think most people will upgrade to 2.22.0 and skip the 2.21.x release so a warning is the best we could do in my opinion.

@beorn7
Copy link
Member

beorn7 commented Feb 5, 2021

Thanks for more hints about the test flakiness. By now, we can be pretty confident that it has nothing to do with the changes here. So whoever has an idea how to improve it, let's handle it in a separate issue or PR, e.g. #2461 #2453

@beorn7
Copy link
Member

beorn7 commented Feb 5, 2021

About the release strategy: We usually only do development on the last release branch for bugfixing. Once a new minor release has been cut, we rarely do bugfixes for the previous minor release.

In this case, the fix seems to be so easy that I'm tempted to just propose it, cut 0.21.1, and then warn users if they want to upgrade a whole cluster without risk of inverting the meaning of their silences while the upgrade is happening, they should first upgrade to 0.21.1 before upgrading that one to 0.22.0.

@beorn7
Copy link
Member

beorn7 commented Feb 5, 2021

@vladimiroff Thanks for all the work. I still need to find time for a code level review, but here is a thought about the UI:

My idea was to not add another checkbox to the silence form but to change it completely to the same kind of UI we have for alert filtering on the landing page. I.e. instead of this:

image

it would look like

image

With the basic idea that it should be easy to re-use the UI code we already have for the main alerts page. (I'm not a UI developer, either, but that's what I would naively assume. :-)

I think this would provide consistency, ease of use, and be much more compact.

@beorn7
Copy link
Member

beorn7 commented Feb 5, 2021

For the 0.21 fix, I have now created #2479

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good from the general code perspective. I just had an idea how to do the formatting more consistently, see comments.

And I really think we should at least try to use the UI I have proposed. @vladimiroff do you think you have a chance implementing it? Otherwise, I can try to get support from someone fluent in Elm.

Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
@vladimiroff
Copy link
Contributor Author

I think this looks good from the general code perspective. I just had an idea how to do the formatting more consistently, see comments.

Thanks for the review. Addressed comments in 5f0cbe2.

And I really think we should at least try to use the UI I have proposed. @vladimiroff do you think you have a chance implementing it? Otherwise, I can try to get support from someone fluent in Elm.

Would definitely prefer somebody else to tackle properly the UI part. If there's nobody available I could have a go at it in a week or two. Does it make sense to merge this, being functionally complete, and have a separate issue for UI before releasing v0.22?

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment to finish up the matchers formatting.

Comment on lines 133 to 137
output := []string{}
for _, matcher := range matchers {
output = append(output, extendedFormatMatcher(*matcher))
output = append(output, simpleFormatMatcher(*matcher))
}
return strings.Join(output, " ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also utilize how labels.Matchers.String formats multiple matchers, i.e. {foo="bar", dings="bums"} rather than foo="bar" dings="bums" as it is done now. And I think now that you have implemented labelsMatcher, it's very easy:

lms := labels.Matchers{}
for _, matcher := range matchers {
	lms = append(lms, labelsMatcher(*matcher)
}
return lms.String()

@beorn7
Copy link
Member

beorn7 commented Feb 17, 2021

About the UI issue: I feel a bit uncomfortable to change the UI twice, even if we only cut a release after the final change. Ideally, we could get to the final (or "final-like") UI in this PR.

It's a long shot, but perhaps @w0rm or @stuartnelson3 or @mxinden might find a bit of time to help out @vladimiroff? I'm pretty sure it should be easy for someone fluent in Elm.

@w0rm
Copy link
Member

w0rm commented Feb 17, 2021

@vladimiroff thanks for working on supporting negative matching in silences and even giving the Elm part a try!

I agree with @beorn7 that reusing the same UI would be ideal. We might be able to reuse a lot of existing code and simplify the UI at the benefit of providing the new feature. I am happy to implement the proposed UI (ETA next Monday).

Will it be possible to only merge the changes to the Go codebase, and let me open a separate PR that updates the UI?

@beorn7
Copy link
Member

beorn7 commented Feb 18, 2021

@w0rm thanks a million for your help!

@vladimiroff how about removing the UI changes from this PR then? Once my last comment in addressed, we can merge this, and @w0rm can do his magic.

There is non-trivial value escaping going inside pkg/labels.

Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
Signed-off-by: Kiril Vladimirov <kiril@vladimiroff.org>
@vladimiroff
Copy link
Contributor Author

@vladimiroff how about removing the UI changes from this PR then? Once my last comment in addressed, we can merge this, and @w0rm can do his magic.

Sure, think I've just addressed the comment in 84dd6ab. Removing all UI changes is as easy as a simple revert, but then the build fails. What I did is simply removing the changes in silence form (i.e. that 'Regex' checkbox). Thus API compliance and proper rendering of non-equal matchers is still here.

@beorn7 beorn7 merged commit d57b2dc into prometheus:master Feb 19, 2021
@beorn7
Copy link
Member

beorn7 commented Feb 19, 2021

OK, thanks. Let's go with this.

@w0rm all is set for you to do your magic. It would be great if old deep-links to create silences continued to work. (People have a lot of those out there, e.g. to provide a silence link in a notification.)

yurkeen added a commit to yurkeen/alertmanager that referenced this pull request May 2, 2021
The changes introduced in prometheus#2471 (see prometheus@2b6315f#diff-4975284c2279cb0d4dbc76c701ea7c35725245a92caaf2e8be8e5b5e56cf1ab0) allow generated `Matcher.IsEqual` can become nil (see https://github.com/prometheus/alertmanager/blob/2b6315f399b95c4e327dcd5f5eb10866ef844252/api/v2/models/matcher.go#L35), as its default value is considered  true now. Checking for nil value of this field and default it to `true` in the `labelsMatcher()` func fixes nil pointer dereference.
yurkeen added a commit to yurkeen/alertmanager that referenced this pull request May 2, 2021
The changes introduced in prometheus#2471 (see prometheus@2b6315f#diff-4975284c2279cb0d4dbc76c701ea7c35725245a92caaf2e8be8e5b5e56cf1ab0) allow generated `Matcher.IsEqual` can become nil (see https://github.com/prometheus/alertmanager/blob/2b6315f399b95c4e327dcd5f5eb10866ef844252/api/v2/models/matcher.go#L35), as its default value is considered  true now. Checking for nil value of this field and default it to `true` in the `labelsMatcher()` func fixes nil pointer dereference.

Signed-off-by: Yury Evtikhov <yury@cloudflare.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow negative (re) matching in silences

4 participants