Poison message handling for DT.AS V2 by bachuv · Pull Request #1130 · Azure/durabletask

bachuv · 2024-07-09T19:17:07Z

This PR adds poison message handling for DurableTask.AzureStorage V2. There's a new PoisonMessageDeuqueCountThreshold setting in AzureStorageOrchestrationServiceSettings where the default is set to 20. If a message is dequeued and fails to deserialize more than the number that's set for PoisonMessageDeuqueCountThreshold, then it gets added to a new <TaskHubName>Poison table in Azure Storage. Customers can go to this table in their storage account and look at these poison messages.

cgillum

Biggest blocker for me on this PR are the changes to DurableTask.Core. I don't think a case has been made for this yet, and it needs further discussion. At this point, my preference is to change only DurableTask.AzureStorage.

cgillum · 2024-07-09T19:54:13Z

src/DurableTask.AzureStorage/AzureStorageOrchestrationServiceSettings.cs

+        /// Gets or sets the maximum dequeue count of any message before it is flagged as a "poison message".
+        /// The default value is 20.
+        /// </summary>
+        public int PoisonMessageDeuqueCountThreshold { get; set; } = 20;


Suggested change

public int PoisonMessageDeuqueCountThreshold { get; set; } = 20;

public int PoisonMessageDequeueCountThreshold { get; set; } = 20;

cgillum · 2024-07-09T19:56:10Z

src/DurableTask.AzureStorage/Messaging/ControlQueue.cs

        {
+            // We have limited information about the details of the message
+            // since we failed to deserialize it.
+            this.settings.Logger.MessageFailure(


Abandoning a message is not necessarily an error. There are other cases, such as out-of-order messaging race conditions where we're able to recover gracefully by just abandoning a message once.

Consider putting this error message inside of an if (exception != null) block.

I believe this overload is only called in cases where there's an exception, as you can see by the comment in line 193:

durabletask/src/DurableTask.AzureStorage/Messaging/ControlQueue.cs

Line 192 in 934a5b6

// This overload is intended for cases where we aren't able to deserialize an instance of MessageData.

. In that case, doesn't that mean the exception will never be null, @cgillum ?

cgillum · 2024-07-09T19:57:32Z

src/DurableTask.AzureStorage/Messaging/TaskHubQueue.cs

    using DurableTask.Core;
    using DurableTask.Core.History;

+


nit: remove extra newline

cgillum · 2024-07-09T19:58:42Z

src/DurableTask.AzureStorage/Messaging/TaskHubQueue.cs

+                // Create the poison message table if it doesn't exist
+                string poisonMessageTableName = this.settings.TaskHubName.ToLowerInvariant() + "Poison";
+                Table poisonMessagesTable = this.azureStorageClient.GetTableReference(poisonMessageTableName);
+                await poisonMessagesTable.CreateIfNotExistsAsync();


We may want to consider caching to remember if we've already created this table. Otherwise, we may end up spamming the storage account with a lot of these calls.

cgillum · 2024-07-09T20:00:39Z

src/DurableTask.AzureStorage/Messaging/TaskHubQueue.cs

+            }
+        }
+
+        public async Task<bool> TryHandlingDeserializationPoisonMessage(QueueMessage queueMessage, Exception deserializationException)


It looks like there's a lot of redundancy between these two methods. Is it possible to refactor them to share some logic?

cgillum · 2024-07-09T20:03:23Z

src/DurableTask.Core/History/HistoryEvent.cs

+        /// Gets or sets user-facing details for why a message was labeled as poison.
+        /// This is to be set by each storage provider.
+        /// </summary>
+        public string PoisonGuidance { get; set; } = "";


I'm not comfortable with adding these properties to every DTFx history event, especially if we don't have a plan or design for whether/how to implement poison message handling for other backend providers. For the purposes of this PR, I'd prefer we make changes only to DurableTask.AzureStorage unless we can get broad agreement from across the team that this is the right approach.

Makes sense. Just to add more context here (may be obvious) - the reason for the DTFx.Core change here is solely that it allows us to easily fail the function (activity or orchestrator) with the poison message. In the activity/orchestrator dispatchers, we are checking if isPoison is true and, if so, we're replacing the history event of the poison message with a failure event.

I assume that if instead of doing this, we replace the message itself with a failure event in the DTFx.AS level, we should be able to get away without this "isPoison" property. I recall trying to do this, and not finding it to be super obvious, but it's worth trying again. @cgillum if you have suggestions, I'm all ears!

Kencdk · 2025-01-23T08:18:39Z

Instead of having a "one-size fits no-one" implementation; could this be extended to include an error handler which can be defined by developers instead, potentially with this as the default implementation?

That would allow developers to easily add new metrics, logs and store messages where they see fit, instead of having to rely on table storage and updates to the main library in case we would like more data readily available.

davidmrdavid · 2025-01-23T16:18:46Z

(I'm no longer with directly with the team, but just my 2 cents)
@Kencdk - in theory, yes, a user-defined handler could be good. But that adds a bit of complexity to the design and may open us up to needing to support this across all backends (usually DTFx code, like a handler, is portable across backends) and I'd worry about that further delaying this feature.

I think this PR is close to merge, just needed one or two touch ups that were discussed internally some time back. It might be most valuable to tackle those and provide some protection over poison messages, and then refine the solution over the long term.

bachuv added 3 commits June 25, 2024 12:05

initial commit

9ab833a

updated package versions

e01d42e

reverted package version updates

75dd96f

cgillum requested changes Jul 9, 2024

View reviewed changes

davidmrdavid mentioned this pull request Jul 22, 2024

Add simple poison message handling for Azure Storage #1063

Closed

davidmrdavid mentioned this pull request Oct 21, 2024

Poison messages handling #1040

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poison message handling for DT.AS V2#1130

Poison message handling for DT.AS V2#1130
bachuv wants to merge 3 commits intoazure-storage-v12from
poison-message-handling

bachuv commented Jul 9, 2024

Uh oh!

cgillum left a comment

Uh oh!

cgillum Jul 9, 2024

Uh oh!

cgillum Jul 9, 2024

Uh oh!

davidmrdavid Jul 10, 2024

Uh oh!

cgillum Jul 9, 2024

Uh oh!

cgillum Jul 9, 2024

Uh oh!

cgillum Jul 9, 2024

Uh oh!

cgillum Jul 9, 2024

Uh oh!

davidmrdavid Jul 10, 2024

Uh oh!

Kencdk commented Jan 23, 2025

Uh oh!

davidmrdavid commented Jan 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	public int PoisonMessageDeuqueCountThreshold { get; set; } = 20;
	public int PoisonMessageDequeueCountThreshold { get; set; } = 20;

Conversation

bachuv commented Jul 9, 2024

Uh oh!

cgillum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kencdk commented Jan 23, 2025

Uh oh!

davidmrdavid commented Jan 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants