Skip to content

Add orchestrator scheduled actions check when suspended#1024

Merged
davidmrdavid merged 3 commits intomainfrom
nytian/suspend_bug
Jan 12, 2024
Merged

Add orchestrator scheduled actions check when suspended#1024
davidmrdavid merged 3 commits intomainfrom
nytian/suspend_bug

Conversation

@nytian
Copy link
Collaborator

@nytian nytian commented Jan 5, 2024

This PR fixes issue #2519

When calling SuspendAsync URL, condition can happen that the orchestrator is suspended when there is already an action scheduled and added to the orchestratorActionsMap. However, our design will add any later event into a temporary queue during suspended time. Thus, this scheduled action will never be removed or completed and will be executed endless times until orchestrator is resumed. And this execution history will also be added to the History Table which will cause the non-deterministic error when orchestrator is resumed.

The race condition occurs when an ExecutionSuspended message is processed after TaskCompleted. Additionally, there is a subsequent activity task pending scheduling after the completion of this task. During the processingTaskCompleted, the result is sent back to the extension level, and thus, durable extension will go to the next line and then schedule a new task, adds it to the action map which will cause the issue.

For debug, we can verify if a task has been executed multiple times by checking if columnTaskEventId in Kusto or TaskScheduledId in History Table. Each task should have a unique identifier in these fields.

The fix of this PR is to guarantee orchestratorActionsMap is empty when orchestrator is suspended. Basically, we will check the orchestratorActionsMap when suspending an orchestrator. If orchestratorActionsMap is not null, the action will be saved to a temporary dictionary and will be added back to the orchestratorActionsMap when the orchestrator is resumed.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but mostly looks good!

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented Jan 8, 2024

I know there's a "stress tests" project here that I believe is meant to catch race conditions like the bug you're fixing. I wonder if you can re-use that infrastructure to set up a test for suspend/resume. Do you might giving that a look?

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nytian: Do you have ideas for any telemetry that would have helped identify this bug faster? If so, I wonder if we could add that to this PR.

Copy link
Collaborator

@bachuv bachuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue!

@nytian nytian requested review from bachuv and davidmrdavid January 10, 2024 00:39
Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sync'ed w/ Naiyuan and, after the fact, we determined that we could have identified this bug by paying attention to the TaskScheduledId column of our telemetry. So I don't strictly think we need new logs anymore, just better team-wide awareness to pay attention to that source of information. Still, please get @bachuv's approval as well before merging.

Finally, @nytian, you wrote the following:

During the processingTaskCompleted, the result is sent back to the extension level, where SuspendAsync call is impossible to prevent the triggering of the next line. Thus, the extension will schedule a new activity and adds it to the action map which will cause the issue

Assuming I'm reading this correctly, I'm not certain that this statement is 100% accurate. In particular, DTFx.Core and the Extension run in the same thread so it's not quite right to say that "it's impossible [for DTFx] to prevent the triggering of the next line [of DF Extension user-code]." This is definitely possible so I think the explanation needs to be tweaked slightly :) . Let's discuss this offline later, it's just a nitpick

@nytian
Copy link
Collaborator Author

nytian commented Jan 10, 2024

I sync'ed w/ Naiyuan and, after the fact, we determined that we could have identified this bug by paying attention to the TaskScheduledId column of our telemetry. So I don't strictly think we need new logs anymore, just better team-wide awareness to pay attention to that source of information. Still, please get @bachuv's approval as well before merging.

Finally, @nytian, you wrote the following:

During the processingTaskCompleted, the result is sent back to the extension level, where SuspendAsync call is impossible to prevent the triggering of the next line. Thus, the extension will schedule a new activity and adds it to the action map which will cause the issue

Assuming I'm reading this correctly, I'm not certain that this statement is 100% accurate. In particular, DTFx.Core and the Extension run in the same thread so it's not quite right to say that "it's impossible [for DTFx] to prevent the triggering of the next line [of DF Extension user-code]." This is definitely possible so I think the explanation needs to be tweaked slightly :) . Let's discuss this offline later, it's just a nitpick

@davidmrdavid . Thanks for pointing out! I will update this part.

@davidmrdavid davidmrdavid merged commit 1585a61 into main Jan 12, 2024
@nytian nytian deleted the nytian/suspend_bug branch January 18, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants