Skip to content

Conversation

@jrbourbeau
Copy link
Member

This PR updates the worker's heartbeat to the scheduler to include the duration of actively running tasks. This information can help us make more informed scheduling decisions when there's a discrepancy between the expected task duration (stored on the corresponding TaskPrefix.duration_average) and the actual duration of a running task. For example, this situation can arise in work stealing scenarios when a task takes much longer than expected based on the average task duration. Additionally, sending this information to the scheduler via worker heartbeat helps avoid significant message-per-task overhead.

cc @mrocklin @sheer-coiled for thoughts

active_durations={
key: now - self.tasks[key].start_time
for key in self.active_threads.values()
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should reclaim the term executing from above? That seems to be derivative state from this

@fjetter
Copy link
Member

fjetter commented Oct 28, 2020

Another situation where this might be useful is if the submitted tasks are very long running but the cluster hasn't seen them before. In the current situation this forces users to configure some estimation of a tasks runtime in the configuration to get reasonable responsive scaling behaviour using the Scheduler.adaptive_target. This would require an additional change to the WorkerState.processing dict, though.

@mrocklin
Copy link
Member

Another situation where this might be useful is if the submitted tasks are very long running but the cluster hasn't seen them before. In the current situation this forces users to configure some estimation of a tasks runtime in the configuration to get reasonable responsive scaling behaviour using the Scheduler.adaptive_target. This would require an additional change to the WorkerState.processing dict, though.

Agreed. I think/hope that this should remove most of the need for the config values in the future.

@jrbourbeau
Copy link
Member Author

That's a great point @fjetter, I totally agree with you. I suspect there are a few situations where we can use this duration information to make more informed scheduling decisions. This PR is just a first step to get active task durations to the scheduler

@jrbourbeau
Copy link
Member Author

Not sure why test_close_gracefully failed as the changes here seem unrelated. All CI builds passed after pushing an empty commit, so I'm inclined to mark it as a flaky test and merge this (I opened up #4201 to track the test failure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants