Treat error response payloads from Slack as errors#3121
Treat error response payloads from Slack as errors#3121gotjosh merged 1 commit intoprometheus:mainfrom
Conversation
|
@simonpasquier are you the best person to review this? |
|
Just rebased to make this merge-able. @gotjosh, @simonpasquier – any thoughts on this? |
|
Is there anything I can do to get this reviewed, folks? cc @gotjosh, @simonpasquier |
simonpasquier
left a comment
There was a problem hiding this comment.
is there any way that I can reproduce the issue locally?
bc12cb2 to
d5dfcee
Compare
Sure, I found this while using https://slack.com/api/chat.postMessage as |
gotjosh
left a comment
There was a problem hiding this comment.
This is something that the Grafana equivalent does much better, can we instead borrow the idea from what we do there? https://github.com/grafana/alerting/blob/main/receivers/slack/slack.go#L191-L207
|
@gotjosh the grafana/alerting code is AGPL, right? |
|
This situation happened a couple of months ago and I consulted internally what were the steps necessary to allow that. The result is #2845 (comment) as an example. I'll use similar wording as I approve the PR. |
2a10762 to
9ddc399
Compare
Sure! Given alertmanager's use of common Also expanded the test to cover:
I believe functionally this now should align with what Grafana are doing. PTAL? |
a465be9 to
7317c32
Compare
notify/slack/slack.go
Outdated
| return true, notify.RedactURL(err) | ||
| } | ||
| defer notify.Drain(resp) | ||
| defer resp.Body.Close() |
There was a problem hiding this comment.
I'm not 100% familiar with notify.Drain, but why would we stop using it here? The call is deferred, so we can still read resp.Body from the function call on Line 226.
There was a problem hiding this comment.
It was no longer necessary to always drain & close, so I changed it to use standard Go libraries to make it easier to understand. I agree it does not matter, so I returned notify.Drain if you prefer that.
| // (https://api.slack.com/messaging/webhooks#handling_errors) | ||
| func checkTextResponseError(body []byte) (bool, error) { | ||
| if !bytes.Equal(body, []byte("ok")) { | ||
| return false, fmt.Errorf("received an error response from Slack: %s", string(body)) |
There was a problem hiding this comment.
This should be true shouldn't it? Don't we want to retry failed webhooks?
| return false, fmt.Errorf("received an error response from Slack: %s", string(body)) | |
| return true, fmt.Errorf("received an error response from Slack: %s", string(body)) |
There was a problem hiding this comment.
Oh I see, in Grafana we don't retry these errors.
There was a problem hiding this comment.
This is the case when we got a 200 back with an error message. I suspect in most cases it will not indicate an infrastructure error (or an otherwise transient issue) and retrying it will not yield a different result.
The function comment here links to Slack documentation that describes some of those errors, and none of them seem like something where retrying would help.
There was a problem hiding this comment.
Yes I agree! Sorry for the original confusing comment!
notify/slack/slack.go
Outdated
|
|
||
| var data response | ||
| if err := json.Unmarshal(body, &data); err != nil { | ||
| return false, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body)) |
There was a problem hiding this comment.
Same question here, earlier we retried if the body could not be read:
body, err := io.ReadAll(resp.Body)
if err != nil {
return true, errors.Wrap(err, "could not read response body")
}Shouldn't we retry if the JSON cannot be unmarshalled too?
| return false, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body)) | |
| return true, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body)) |
There was a problem hiding this comment.
Oh I see, in Grafana we don't retry these errors.
There was a problem hiding this comment.
Yes, my original proposal (2bd3d4d) retried unmarshalling errors, but I have been asked here to match Grafana's behaviour.
I am happy either way, I just want alertmanager to not silently ignore errors while delivering notifications.
There was a problem hiding this comment.
On the retry side, I think we should keep Alertmanager's behaviour intact (for now) - we can discuss as part of a separate PR if we should retry or not.
There was a problem hiding this comment.
I took a quick look and we don't have a precedence in Alertmanager for these specific kinds of errors. I'm inclined to return true if parsing the response JSON fails, as that's what we also do if reading the response buffer fails.
There was a problem hiding this comment.
Changed this to retry on JSON unmarshalling errors.
|
@gotjosh @simonpasquier something I missed is that these changes update Alertmanager to understand error responses from both Slack's Alertmanager supports both, but the fact that it does so seems coincidental rather than intentional (for example I think we should accept this change because I suspect Alertmanager users are using this integration for both APIs. However, I want to highlight that adding error handling for |
|
LGTM once #3121 (comment) has been fixed! Thanks for your contribution! ❤️ |
As described in the "More error types" section below, Slack API can return errors with a 200 response code: https://slack.dev/node-slack-sdk/web-api#handle-errors This change adds parsing of API response to extract error messages. Signed-off-by: Anton Tolchanov <anton@tailscale.com>
@grobinson-grafana would you be OK with updating the documentation for that? |
|
Sure! I just wanted to confirm that's what we want 🙂 |
|
I just tested a couple of different scenarios.
|
|
PR to update the docs #3455 |
gotjosh
left a comment
There was a problem hiding this comment.
LGTM
Thank you very much for your contribution.
As described in the "More error types" section below, Slack API can return errors with a 200 response code: https://slack.dev/node-slack-sdk/web-api#handle-errors This change adds parsing of API response to extract error messages. Signed-off-by: Anton Tolchanov <anton@tailscale.com>
As described in the "More error types" section below, Slack API can return errors with a 200 response code: https://slack.dev/node-slack-sdk/web-api#handle-errors This change adds parsing of API response to extract error messages. Signed-off-by: Anton Tolchanov <anton@tailscale.com>
As described in the "More error types" section below, Slack API can return errors with a 200 response code:
https://slack.dev/node-slack-sdk/web-api#handle-errors
This change adds parsing of API response to extract error messages.
Signed-off-by: Anton Tolchanov anton@tailscale.com