Update Ruler to use upstream Prom Rule Manager#1571
Update Ruler to use upstream Prom Rule Manager#1571jtlisi merged 11 commits intocortexproject:masterfrom
Conversation
cd4db11 to
6a74045
Compare
470a180 to
ebf8caf
Compare
bf31fbd to
d65c96d
Compare
6b87cd6 to
d3fec66
Compare
5fc8a74 to
e7fd239
Compare
pkg/configs/client/client.go
Outdated
There was a problem hiding this comment.
Did we drop the since functionality for a reason? Any concern for extremely large rule sets that this will return very large amounts of data. Otherwise I like removing it for simplicity.
There was a problem hiding this comment.
Should this method continue returning map[string]configs.VersionedRulesConfig and the translation to the new type occur in pkg/ruler/rules?
There was a problem hiding this comment.
I would say since was for large numbers of tenants, regardless of size of rule set. And I am a bit nervous about dropping it.
There was a problem hiding this comment.
I do understand the concern. The main reason since was dropped is because the entire ruleset must be hashed when generating mapped files. Otherwise changes to the ring will not be reflected in the scheduled evaluations. However, polling such a large payload is not ideal. One option would be for the config client to cache the previous response and have an internal usage of the since variable. That way it can keep an up-to-date set of the active rules configs and only poll for changes.
There was a problem hiding this comment.
Per discussions I'm re-adding since support now.
pkg/configs/client/client.go
Outdated
There was a problem hiding this comment.
Good catch, I'll remove this.
pkg/configs/client/client.go
Outdated
There was a problem hiding this comment.
Is this like a diff of rules you're parsing and rebuilding the final state?
There was a problem hiding this comment.
Yes, that is essentially what is going on.
pkg/ruler/compat.go
Outdated
There was a problem hiding this comment.
Why are we locking here? None of the Appender implementations in prometheus lock suggesting these functions are not reentrant.
There was a problem hiding this comment.
A lock is required here because we pool the samples for a user to the same appendable.
There was a problem hiding this comment.
I think previously in Cortex we created a separate appendableAppender for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender now?
There was a problem hiding this comment.
Yea the rule groups for each user will all share an appendableAppender. There are some advantages to this approach. Primarily, it will make #1396 easier to solve since output limits for a user can be configured in the same place.
ea8d2c4 to
7e394fa
Compare
|
@jtlisi I restored the configs Client functionality to what it was before and made a This will be easy to extend in the future by adding other implementations such as |
pkg/ruler/rules/store.go
Outdated
There was a problem hiding this comment.
This implementation looks good to me. @bboreham are you ok with abstracting since into a concrete type that lives behind the interface? That way a full set of rules can be returned on each poll. However, only new rules will be polled from the config service? I don't think we can work around polling the entire ruleset currently since the horizontally sharded ruler will need to hash each rule group to ensure it is evaluating the appropriate set of rules.
There was a problem hiding this comment.
The thing that I worry about is like this: say we have 10,000 rule groups across all tenants, and one tenant changes one of them, does the program do 10,000 things or 1 thing?
There was a problem hiding this comment.
Feel free to correct any errors @jtlisi.
As it is currently designed, once a polling cycle each ruler will calculate a hash for every rule group to determine which groups it should process locally. This will happen 10,000 times per ruler in your scenario.
Next it will take this subset of the rule groups and compare them to locally stored files on disk only updating those files on disk that have changed. This will happen 10,000 / n times per ruler where n is the number of rulers.
Then, if any files have changed or been added for a given user, it will clear the old prometheus rules manager for that user and build a new one pointed at the new set of files.
Even if we do not perform this process once a polling cycle we would need to at least do this process when certain events happen (such as a ruler joining or leaving the ring). I believe @jtlisi was preferring the straightforward nature of this approach.
004373f to
1c47e52
Compare
|
@jtlisi I rewrote the |
bboreham
left a comment
There was a problem hiding this comment.
I haven't finished reading all the changes, but I have some notes.
Particularly when I tried it, it seemed to barf on the "v1" rules:
level=error ts=2019-11-01T17:29:22.664302075Z caller=ruler.go:331 msg="unable to poll for rules" err="yaml: unmarshal errors:\n line 53: cannot unmarshal !!str `ALERT D...` into rulefmt.RuleGroups"
CHANGELOG.md
Outdated
There was a problem hiding this comment.
That's going to cause disruption - can we map it to "deprecated" (ignored) first?
Any advice to the end-user what to do instead?
There was a problem hiding this comment.
I updated it so the flags aren't removed and are instead deprecated with a message.
pkg/configs/client/client.go
Outdated
There was a problem hiding this comment.
Could this be done using CollectedRequest() ?
pkg/ruler/rules/rules.proto
Outdated
There was a problem hiding this comment.
could we have an introductory comment here saying what these protobuf definitions are for?
There was a problem hiding this comment.
The use of protos made more sense before I split this out from a larger PR a few months back. The proto format is used to store rule groups in a denormalized way in an object store backend. It also is used to communicate between rulers to fulfill the /api/v1/rules endpoint that reports the status of rules with their rule health. Since each ruler only knows the state of rules it is currently responsible for it needs to communicate with each ruler in the ring to get a complete view of rule health. To implement this feature a GRPC service is implemented by each ruler.
pkg/ruler/compat.go
Outdated
There was a problem hiding this comment.
I think previously in Cortex we created a separate appendableAppender for each group, which was run on a single goroutine. So there must also be some change that means we have more goroutines talking to the same appendableAppender now?
37858a4 to
06e54b2
Compare
CHANGELOG.md
Outdated
There was a problem hiding this comment.
Changes should be on top.
a842199 to
52aa0ea
Compare
|
@bboreham I fixed V1 rule loading and refactored based on your comments. This should be good for a second look. |
2c9dc72 to
166c094
Compare
pkg/ruler/ruler.go
Outdated
There was a problem hiding this comment.
ruler.configs.url appears to be missing from this list of deprecated flags. It was part of the deleted pkg/configs/client/config.go file.
There was a problem hiding this comment.
This flag also still exists. It got move a bit and is registered in pkg/ruler/storage.go https://github.com/cortexproject/cortex/pull/1571/files#diff-16c509ab46b783eb193e10999f09ed31R21
1540040 to
eb76b8b
Compare
bboreham
left a comment
There was a problem hiding this comment.
Looks good enough to me.
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
…ests Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com> Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
eb76b8b to
4418ae9
Compare
This PR is a refactor of #1532 to utilize the Prometheus Rule Manager to schedule and evaluate rule groups.
Overview
Fixes #477
Fixes #493