fix(marker): stop state leakage from aggregation groups#4438
fix(marker): stop state leakage from aggregation groups#4438SuperQ merged 1 commit intoprometheus:mainfrom
Conversation
20c2483 to
cdff2d7
Compare
101e444 to
83a127e
Compare
SuperQ
left a comment
There was a problem hiding this comment.
Looking at the test changes, it doesn't seem like this exercises the problem behavior. Would you mind including some additional testing to make sure we don't break in the future?
This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in prometheus#4402. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Done, added 3 scenarios. The code is change to make sure that we don't accidentally delete a marker due to a race condition. Please have a look and let me know if we should add more. But while adding these I noticed the whole Marker setup to be a week point in terms of consistency, we need to come up with a better solution at some point. |
| ag.logger.Error("error on delete alerts", "err", err) | ||
| } else { | ||
| // Delete markers for resolved alerts that are not in the store. | ||
| for _, alert := range resolvedSlice { |
There was a problem hiding this comment.
Just heads up this is a race condition for the same reason we had to implement DeleteIfNotModified. There is a missing check to make sure the deleted alert is the same alert returned from ag.alerts.Get(alert.Fingerprint()).
Without this check what can happen is we delete the alert in between DeletedIfNotModified and then before we call Get a new alert is received. What happens then is we delete the marker of the new alert by mistake.
There was a problem hiding this comment.
As we discussed on Slack, this is safe since even if we delete the marker the alert status would then be unprocessed:
Lines 261 to 274 in 52eb1fc
This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in #4402.
Fixes: #4402