Skip to content

feat: implement phantom_threading to group email alerts into threads#4623

Merged
SuperQ merged 1 commit intoprometheus:mainfrom
stapelberg:threading
Dec 2, 2025
Merged

feat: implement phantom_threading to group email alerts into threads#4623
SuperQ merged 1 commit intoprometheus:mainfrom
stapelberg:threading

Conversation

@stapelberg
Copy link
Contributor

Some email clients such as Gmail apparently use their own heuristics for threading and already implement this behavior based on the subject.

But for users of other email clients that only implement threading based on the relevant headers (e.g. notmuch), those users currently get one email thread for each newly firing alert.

With phantom_threading enabled, all alert emails (of the same alert) on the same day are grouped into the same thread. Much nicer :)


I have tested this manually and you can see the effect start to work in this screenshot:

2025-10-22-alertmanager-threading

(Monday morning, I got one thread per alert email notification; in the evening, the threading change was effective and emails are grouped into the daily thread.)

@stapelberg stapelberg force-pushed the threading branch 3 times, most recently from dd8d44b to ab70ee3 Compare October 22, 2025 06:55
Copy link
Contributor

@sysadmind sysadmind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add a test to make sure that functionality doesn't break in the future. I think we need to find a decision on if we want daily threading only, or if we want the user to choose from a set of threading options before we move forward.

Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is an interesting change! Threading for alertmanager notifications would be very useful for teams using email!

I have a couple questions:

  1. Did you consider using the GroupKey has to source for the thread root id? If I understand the rest of this change correct, this would result in one thread per group. It's also consistent with how we typically group/dedup notifications. For example, the PagerDuty integration uses the hash of the group key as the PagerDuty dedup key: https://github.com/prometheus/alertmanager/blob/main/notify/pagerduty/pagerduty.go#L241. This would also make threading configured directly by routing config.
  2. It looks like we end up just setting some headers on the email. Is this something that could be implemented by templated header? I'm also wondering if this has any strange interaction with the user's config if they set headers.
  3. The comment mentions email clients that use "(commonly used) JWZ" - do you know how this change interacts with email clients that don't behave that way?

@stapelberg
Copy link
Contributor Author

Thank you both for your review! Answers inline:

  1. Did you consider using the GroupKey has to source for the thread root id? If I understand the rest of this change correct, this would result in one thread per group. It's also consistent with how we typically group/dedup notifications. For example, the PagerDuty integration uses the hash of the group key as the PagerDuty dedup key: https://github.com/prometheus/alertmanager/blob/main/notify/pagerduty/pagerduty.go#L241. This would also make threading configured directly by routing config.

Thanks, that’s a great tip! Done.

  1. It looks like we end up just setting some headers on the email. Is this something that could be implemented by templated header? I'm also wondering if this has any strange interaction with the user's config if they set headers.

I prototyped this idea, and if we expose the shortened GroupKeyHash and n.hostname to the template, then users would be able to configure a templated header like so:

References: <alert-{{ .GroupKeyHash }}-{{ range $idx, $alert := .Alerts }}{{ if (eq $idx 0) }}{{ $alert.StartsAt.Format \"2006-01-02\" }}{{ end }}{{ end }}@{{ .Hostname }}>
In-Reply-To: <alert-{{ .GroupKeyHash }}-{{ range $idx, $alert := .Alerts }}{{ if (eq $idx 0) }}{{ $alert.StartsAt.Format \"2006-01-02\" }}{{ end }}{{ end }}@{{ .Hostname }}>

…but that’s pretty complicated for a user.

My preference would be to stick with the high-level configuration option (phantom threading enabled/disabled). If we really think it’s required, we can make the date configurable, but realistically, I don’t see one-thread-per-month as an option that anyone would want. Maybe one-thread-per-calendar-week? But that seems rather unconventional, too.

  1. The comment mentions email clients that use "(commonly used) JWZ" - do you know how this change interacts with email clients that don't behave that way?

The worst that can happen is that email clients keep threading the way they currently do (i.e. no threading for alerts). Email clients must cope with non-existing references (common case: somebody adds you to an email thread, so you don’t have the earlier messages).

Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes!

I prototyped this idea, and if we expose the shortened GroupKeyHash and n.hostname to the template, then users would be able to configure a templated header like so:

I see what you mean. I agree with your conclusion - having explicit config is more clear.

If we really think it’s required, we can make the date configurable, but realistically, I don’t see one-thread-per-month as an option that anyone would want.

I'm of two minds on the one-thread-per-day behavior. One one hand, I understand wanting to have threads end after a while (even now that there should be one thread per-group). On the other hand, it feels a little arbitrary to me.

To me, it seems like the ideal behavior would be to make one thread specific alert group (e.g. when a group resolves, the thread would be ended). However, we'd need to expose the notification reason AND aribtrary metadata (like suggested here) to make that work.

I really don't want to block this change on that feature, since it might be a while before it gets merged. I also could see users wanting daily/monthly/hourly threads anyway.

So I think I'd prefer if the date was configurable (and optional). Would you be open to that? For now, we can support just "daily" or no date-based cutoff, but I'd prefer if the config was left extensible so we can iterate on it over time. Something like:

phantom_threading:
    enabled: true
    thread_by_date: daily

Copy link
Contributor Author

@stapelberg stapelberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done! Please take another look.

Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes! I have just a few comments that I'd like to see addressed.

Copy link
Contributor Author

@stapelberg stapelberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take another look

Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks great to me! Sorry for the last minute request 😞

Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught a handful of things on re-read, sorry to have missed these before.

The only important comment here is the dependency on n.hostname - this will cause unexpected behavior for users who deploy alertmanager in HA mode.

@stapelberg stapelberg force-pushed the threading branch 2 times, most recently from de20260 to 3e389a8 Compare November 26, 2025 17:48
Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes, this looks good to me!

@TheMeier
Copy link
Contributor

Mabye we want some kind of testing?

Some email clients such as Gmail apparently use their own heuristics
for threading and already implement this behavior based on the subject.

But for users of other email clients that only implement threading
based on the relevant headers (e.g. notmuch), those users currently
get one email thread for each newly firing alert.

With threading enabled, all alert emails (of the same alert)
on the same day are grouped into the same thread. Much nicer :)

Signed-off-by: Michael Stapelberg <stapelberg@google.com>
@stapelberg
Copy link
Contributor Author

Mabye we want some kind of testing?

Added a test.

@SuperQ SuperQ merged commit 4e54940 into prometheus:main Dec 2, 2025
7 checks passed
@stapelberg stapelberg deleted the threading branch December 2, 2025 16:15
@SoloJacobs SoloJacobs mentioned this pull request Dec 7, 2025
@fauust
Copy link

fauust commented Dec 18, 2025

❤️ this feature!! Thanks a lot @stapelberg !

// Add threading headers. All notifications for the same alert group
// (identified by key hash) are threaded together.
threadBy := ""
if n.conf.Threading.ThreadByDate != "none" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stapelberg @SuperQ @Spaceman1701
I'm sorry to comment on a MR that's already been merged, but reading the added documentation here as well the implementation, it looks to me like it's quite easy to shoot yourself in the foot as a user.

Consider the following scenario:

  • user sees in existing configuration
    threading:
      enabled: true
      thread_by_date: daily
  • user assumes this is configurable (because in my opinion this strongly suggests it is) and sets it to "hourly" or "monthly"
  • alertmanager will silently accept the parameter but will still deliver them daily because every non-None value triggers the daily behaviour. See the line from the MR below.
    if n.conf.Threading.ThreadByDate != "none" {...}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we should probably patch this so that the user sees a configuration error if it's set to anything other than none or daily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants