feat: implement phantom_threading to group email alerts into threads#4623
feat: implement phantom_threading to group email alerts into threads#4623SuperQ merged 1 commit intoprometheus:mainfrom
Conversation
dd8d44b to
ab70ee3
Compare
sysadmind
left a comment
There was a problem hiding this comment.
I think we also need to add a test to make sure that functionality doesn't break in the future. I think we need to find a decision on if we want daily threading only, or if we want the user to choose from a set of threading options before we move forward.
There was a problem hiding this comment.
Hi, this is an interesting change! Threading for alertmanager notifications would be very useful for teams using email!
I have a couple questions:
- Did you consider using the
GroupKeyhas to source for the thread root id? If I understand the rest of this change correct, this would result in one thread per group. It's also consistent with how we typically group/dedup notifications. For example, the PagerDuty integration uses the hash of the group key as the PagerDuty dedup key: https://github.com/prometheus/alertmanager/blob/main/notify/pagerduty/pagerduty.go#L241. This would also make threading configured directly by routing config. - It looks like we end up just setting some headers on the email. Is this something that could be implemented by templated header? I'm also wondering if this has any strange interaction with the user's config if they set headers.
- The comment mentions email clients that use "
(commonly used) JWZ" - do you know how this change interacts with email clients that don't behave that way?
ab70ee3 to
418578d
Compare
|
Thank you both for your review! Answers inline:
Thanks, that’s a great tip! Done.
I prototyped this idea, and if we expose the shortened GroupKeyHash and n.hostname to the template, then users would be able to configure a templated header like so: …but that’s pretty complicated for a user. My preference would be to stick with the high-level configuration option (phantom threading enabled/disabled). If we really think it’s required, we can make the date configurable, but realistically, I don’t see one-thread-per-month as an option that anyone would want. Maybe one-thread-per-calendar-week? But that seems rather unconventional, too.
The worst that can happen is that email clients keep threading the way they currently do (i.e. no threading for alerts). Email clients must cope with non-existing references (common case: somebody adds you to an email thread, so you don’t have the earlier messages). |
Spaceman1701
left a comment
There was a problem hiding this comment.
Thanks for making those changes!
I prototyped this idea, and if we expose the shortened GroupKeyHash and n.hostname to the template, then users would be able to configure a templated header like so:
I see what you mean. I agree with your conclusion - having explicit config is more clear.
If we really think it’s required, we can make the date configurable, but realistically, I don’t see one-thread-per-month as an option that anyone would want.
I'm of two minds on the one-thread-per-day behavior. One one hand, I understand wanting to have threads end after a while (even now that there should be one thread per-group). On the other hand, it feels a little arbitrary to me.
To me, it seems like the ideal behavior would be to make one thread specific alert group (e.g. when a group resolves, the thread would be ended). However, we'd need to expose the notification reason AND aribtrary metadata (like suggested here) to make that work.
I really don't want to block this change on that feature, since it might be a while before it gets merged. I also could see users wanting daily/monthly/hourly threads anyway.
So I think I'd prefer if the date was configurable (and optional). Would you be open to that? For now, we can support just "daily" or no date-based cutoff, but I'd prefer if the config was left extensible so we can iterate on it over time. Something like:
phantom_threading:
enabled: true
thread_by_date: daily
418578d to
f938af6
Compare
stapelberg
left a comment
There was a problem hiding this comment.
All done! Please take another look.
f938af6 to
a96c1be
Compare
Spaceman1701
left a comment
There was a problem hiding this comment.
Thanks for making those changes! I have just a few comments that I'd like to see addressed.
a96c1be to
42aa449
Compare
stapelberg
left a comment
There was a problem hiding this comment.
Please take another look
Spaceman1701
left a comment
There was a problem hiding this comment.
Everything else looks great to me! Sorry for the last minute request 😞
42aa449 to
e4105e9
Compare
Spaceman1701
left a comment
There was a problem hiding this comment.
I caught a handful of things on re-read, sorry to have missed these before.
The only important comment here is the dependency on n.hostname - this will cause unexpected behavior for users who deploy alertmanager in HA mode.
de20260 to
3e389a8
Compare
Spaceman1701
left a comment
There was a problem hiding this comment.
Thanks for making those changes, this looks good to me!
|
Mabye we want some kind of testing? |
3e389a8 to
0438ce4
Compare
Some email clients such as Gmail apparently use their own heuristics for threading and already implement this behavior based on the subject. But for users of other email clients that only implement threading based on the relevant headers (e.g. notmuch), those users currently get one email thread for each newly firing alert. With threading enabled, all alert emails (of the same alert) on the same day are grouped into the same thread. Much nicer :) Signed-off-by: Michael Stapelberg <stapelberg@google.com>
0438ce4 to
8234e65
Compare
Added a test. |
|
❤️ this feature!! Thanks a lot @stapelberg ! |
| // Add threading headers. All notifications for the same alert group | ||
| // (identified by key hash) are threaded together. | ||
| threadBy := "" | ||
| if n.conf.Threading.ThreadByDate != "none" { |
There was a problem hiding this comment.
@stapelberg @SuperQ @Spaceman1701
I'm sorry to comment on a MR that's already been merged, but reading the added documentation here as well the implementation, it looks to me like it's quite easy to shoot yourself in the foot as a user.
Consider the following scenario:
- user sees in existing configuration
threading: enabled: true thread_by_date: daily
- user assumes this is configurable (because in my opinion this strongly suggests it is) and sets it to "hourly" or "monthly"
- alertmanager will silently accept the parameter but will still deliver them daily because every non-None value triggers the daily behaviour. See the line from the MR below.
if n.conf.Threading.ThreadByDate != "none" {...}
There was a problem hiding this comment.
Good catch, we should probably patch this so that the user sees a configuration error if it's set to anything other than none or daily.
Some email clients such as Gmail apparently use their own heuristics for threading and already implement this behavior based on the subject.
But for users of other email clients that only implement threading based on the relevant headers (e.g. notmuch), those users currently get one email thread for each newly firing alert.
With phantom_threading enabled, all alert emails (of the same alert) on the same day are grouped into the same thread. Much nicer :)
I have tested this manually and you can see the effect start to work in this screenshot:
(Monday morning, I got one thread per alert email notification; in the evening, the threading change was effective and emails are grouped into the daily thread.)