Affected Version
All versions since 0.9.1.
Description
The seekableSupervisor does the below when a replica is succeeded.
- Check the status of all other replicas from
taskStorage.
- Stop all replicas if they are not finished yet.
- For the tasks of unknown status, the supervisor kills them.
- If the stop request fails for some tasks, the supervisor kills them.
However, there's some race in this algorithm because task status is not updated in real time. Instead, the supervisor updates it per runNotice. As a result, the supervisor can kill some already finished tasks successfully if their status is not updated yet. This would lead to mark them as failed even though they are finished as succeeded in the task logs, which seems very confused.
One way to workaround this problem is to check task status more eagerly. However, this would just mitigate this issue happening less. I think we eventually need the following changes in the future.
- Updating task status immediately when the status change is notified to the overlord.
- Add a new task status for canceled tasks.
I'm seeing this problem happening very frequently in our cluster and so marking as a release blocker fo 0.15.0.
Affected Version
All versions since 0.9.1.
Description
The
seekableSupervisordoes the below when a replica is succeeded.taskStorage.However, there's some race in this algorithm because task status is not updated in real time. Instead, the supervisor updates it per
runNotice. As a result, the supervisor can kill some already finished tasks successfully if their status is not updated yet. This would lead to mark them as failed even though they are finished as succeeded in the task logs, which seems very confused.One way to workaround this problem is to check task status more eagerly. However, this would just mitigate this issue happening less. I think we eventually need the following changes in the future.
I'm seeing this problem happening very frequently in our cluster and so marking as a release blocker fo 0.15.0.