Poison message handling for DT.AS V2#1130
Conversation
cgillum
left a comment
There was a problem hiding this comment.
Biggest blocker for me on this PR are the changes to DurableTask.Core. I don't think a case has been made for this yet, and it needs further discussion. At this point, my preference is to change only DurableTask.AzureStorage.
| /// Gets or sets the maximum dequeue count of any message before it is flagged as a "poison message". | ||
| /// The default value is 20. | ||
| /// </summary> | ||
| public int PoisonMessageDeuqueCountThreshold { get; set; } = 20; |
There was a problem hiding this comment.
| public int PoisonMessageDeuqueCountThreshold { get; set; } = 20; | |
| public int PoisonMessageDequeueCountThreshold { get; set; } = 20; |
| { | ||
| // We have limited information about the details of the message | ||
| // since we failed to deserialize it. | ||
| this.settings.Logger.MessageFailure( |
There was a problem hiding this comment.
Abandoning a message is not necessarily an error. There are other cases, such as out-of-order messaging race conditions where we're able to recover gracefully by just abandoning a message once.
Consider putting this error message inside of an if (exception != null) block.
There was a problem hiding this comment.
I believe this overload is only called in cases where there's an exception, as you can see by the comment in line 193:
. In that case, doesn't that mean the exception will never be null, @cgillum ?| using DurableTask.Core; | ||
| using DurableTask.Core.History; | ||
|
|
||
|
|
| // Create the poison message table if it doesn't exist | ||
| string poisonMessageTableName = this.settings.TaskHubName.ToLowerInvariant() + "Poison"; | ||
| Table poisonMessagesTable = this.azureStorageClient.GetTableReference(poisonMessageTableName); | ||
| await poisonMessagesTable.CreateIfNotExistsAsync(); |
There was a problem hiding this comment.
We may want to consider caching to remember if we've already created this table. Otherwise, we may end up spamming the storage account with a lot of these calls.
| } | ||
| } | ||
|
|
||
| public async Task<bool> TryHandlingDeserializationPoisonMessage(QueueMessage queueMessage, Exception deserializationException) |
There was a problem hiding this comment.
It looks like there's a lot of redundancy between these two methods. Is it possible to refactor them to share some logic?
| /// Gets or sets user-facing details for why a message was labeled as poison. | ||
| /// This is to be set by each storage provider. | ||
| /// </summary> | ||
| public string PoisonGuidance { get; set; } = ""; |
There was a problem hiding this comment.
I'm not comfortable with adding these properties to every DTFx history event, especially if we don't have a plan or design for whether/how to implement poison message handling for other backend providers. For the purposes of this PR, I'd prefer we make changes only to DurableTask.AzureStorage unless we can get broad agreement from across the team that this is the right approach.
There was a problem hiding this comment.
Makes sense. Just to add more context here (may be obvious) - the reason for the DTFx.Core change here is solely that it allows us to easily fail the function (activity or orchestrator) with the poison message. In the activity/orchestrator dispatchers, we are checking if isPoison is true and, if so, we're replacing the history event of the poison message with a failure event.
I assume that if instead of doing this, we replace the message itself with a failure event in the DTFx.AS level, we should be able to get away without this "isPoison" property. I recall trying to do this, and not finding it to be super obvious, but it's worth trying again. @cgillum if you have suggestions, I'm all ears!
|
Instead of having a "one-size fits no-one" implementation; could this be extended to include an error handler which can be defined by developers instead, potentially with this as the default implementation? That would allow developers to easily add new metrics, logs and store messages where they see fit, instead of having to rely on table storage and updates to the main library in case we would like more data readily available. |
|
(I'm no longer with directly with the team, but just my 2 cents) I think this PR is close to merge, just needed one or two touch ups that were discussed internally some time back. It might be most valuable to tackle those and provide some protection over poison messages, and then refine the solution over the long term. |
This PR adds poison message handling for DurableTask.AzureStorage V2. There's a new
PoisonMessageDeuqueCountThresholdsetting inAzureStorageOrchestrationServiceSettingswhere the default is set to 20. If a message is dequeued and fails to deserialize more than the number that's set forPoisonMessageDeuqueCountThreshold, then it gets added to a new<TaskHubName>Poisontable in Azure Storage. Customers can go to this table in their storage account and look at these poison messages.