Skip to content

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Feb 18, 2022

Refresh & follow-up to @martindurant's PR ( #5176 )

Continues the worker to pull out SchedulerState as a separate object used by (though no longer a class inherited by) Scheduler.

Note: This does not separate SchedulerState into a new module, but this sets things up to make that step easier.


  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

@jakirkham jakirkham mentioned this pull request Feb 18, 2022
3 tasks
@jakirkham jakirkham force-pushed the cy_scheduler_state_2 branch from e5ce10c to 3e0057e Compare February 18, 2022 19:13
@martindurant
Copy link
Member

@jakirkham , do you think this is the time that we should introduce a CI run which cythonizes? (cc @quasiben )

@jakirkham
Copy link
Member Author

jakirkham commented Feb 18, 2022

We already do 🙂

# Cythonize scheduler on Python 3.8 builds
if [[ "${{ matrix.python-version }}" = "3.8" ]]; then
python -m pip install -vv --no-deps --install-option="--with-cython" -e .
python -c "from distributed.scheduler import COMPILED; assert COMPILED"

Edit: Work was done in a couple PRs. Most notably PR ( #4764 ), which fixed a few lingering issues.

@martindurant
Copy link
Member

Well OK! It's not too clear from the run names.

@jakirkham jakirkham force-pushed the cy_scheduler_state_2 branch from 3e0057e to 8f53e13 Compare February 18, 2022 19:47
@jakirkham
Copy link
Member Author

Yeah that's fair. Maybe there is a way to get GH Actions to show this? I'm not really sure, but if anyone has suggestions on how to do this, can give it a try.

@jakirkham jakirkham force-pushed the cy_scheduler_state_2 branch 6 times, most recently from 9c928c9 to bfe66da Compare February 18, 2022 20:11
@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2022

Unit Test Results

       12 files  ±0         12 suites  ±0   7h 16m 3s ⏱️ + 9m 31s
  2 620 tests ±0    2 539 ✔️  - 1    80 💤 ±0  1 +1 
15 644 runs  ±0  14 777 ✔️  - 4  866 💤 +3  1 +1 

For more details on these failures, see this check.

Results for commit 4f02566. ± Comparison against base commit fb8484e.

♻️ This comment has been updated with latest results.

@jakirkham jakirkham force-pushed the cy_scheduler_state_2 branch 16 times, most recently from 0830e10 to ffec8f4 Compare February 19, 2022 01:05
@jakirkham jakirkham force-pushed the cy_scheduler_state_2 branch from 0c0b47a to 79686a4 Compare March 1, 2022 00:26
@jakirkham jakirkham force-pushed the cy_scheduler_state_2 branch from 79686a4 to 62f57ba Compare March 1, 2022 00:29
@jakirkham
Copy link
Member Author

Fixed a few other small test failures that cropped up (changing from Scheduler to SchedulerState in a few cases). Seems to be working well.

Also resolved the MEMORY_* thread above.

Started working on splitting out SchedulerState into a separate module. This may take a little bit to do, but should be doable given the work already done here.

@martindurant
Copy link
Member

Shouldn't we push for merge now before splitting the files?

@jakirkham
Copy link
Member Author

jakirkham commented Mar 1, 2022

Sounds ok to me. Would help minimize the merge conflicts we run into by getting some of this in sooner 🙂

Edit: Marked as ready for review. Dropped WIP from the PR title. Also updated OP to note this doesn't move things into a separate module, but merely separates into different objects. Planning on continuing to work on that step though 😉

@jakirkham jakirkham marked this pull request as ready for review March 1, 2022 02:19
@jakirkham jakirkham changed the title [WIP] Extract SchedulerState from Scheduler Extract SchedulerState from Scheduler Mar 1, 2022
@jakirkham
Copy link
Member Author

Thoughts @dask/maintenance? 🙂

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence with this refactoring. On the one side, decoupling server code and state is great and moves us with the scheduler in a similar direction as we intend to go with the worker (see #5736).

On the other side, this refactoring draws the line a bit too arbitrary for my taste. I don't want us to split methods and state up by "cythonized" and "not-cythonized" but I feel this is what is happening here.

We are accessing the Scheduler.state 270 times! I didn't count how many of these actually mutate but many do. This violates principles like encapsulation or "Separation of Concern". These are obviously not principles we have to follow but I believe we should. I am afraid that this split, particularly when one side is cythonized, will make our lives much harder.

@@ -4015,31 +4164,136 @@ def __init__(
self.rpc.allow_offload = False
self.status = Status.undefined

####################
# state properties #
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer us to be much more conservative with API exposure. This PR already removes a few underscores and I would rather see the entire state to be closed off unless there is any concrete reason for it to be public.

Comment on lines +4172 to +4173
def aliases(self):
return self.state.aliases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should aliases and host_info not rather be server attributes than state attributes?

Comment on lines +4183 to +4185
@property
def extensions(self):
return self.state.extensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like for plugins, I'm wondering how the extensions now are supposed to work (are extensions even something we do publicly? this is merely an internal thing to do, isn't it? Users would use plugins I assume)

Comment on lines +4818 to +4825
def get_task_duration(self, ts: TaskState) -> double:
return self.state.get_task_duration(ts)

def get_comm_cost(self, *args, **kwargs):
return self.state.get_comm_cost(*args, **kwargs)

def check_idle_saturated(self, *args, **kwargs):
return self.state.check_idle_saturated(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all are only accessed by stealing. Why not let stealing access scheduler.state instead?


def check_idle_saturated(self, *args, **kwargs):
return self.state.check_idle_saturated(*args, **kwargs)

def update_graph_hlg(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire update_graph_* functions appear to be much close to state than server. Why not move these to SchedulerState?

@jakirkham
Copy link
Member Author

Those concerns are reasonable Florian and I share them as well, but I fear we are putting the cart before the horse. The main issue is the Scheduler has been rather monolithic for a long time.

SchedulerState was an attempt to pull apart these pieces with a focus towards what could be optimized (as that is what drove its development). We are now pivoting and focusing on refactor, which I agree is a good thing (and was not really a priority before). So the starting point as a result is one where we already have the optimized pieces and we want to split them further.

The first step at least as we saw it was to break apart the inheritance structure, which I think you can now see was a fair bit of work on its own. The next step would be to get this into its own file so we are not trying to compile the whole Scheduler (kind of slow, touches a bunch of complex code that doesn't necessarily benefit from Cythonization, etc.). I've already started this, but would rather put it up after this PR lands. Then we can tug further at these various threads to get Scheduler and SchedulerState more self-contained.

However each of these passes is going to touch a fair bit of code, which likely are touched by other PRs. As a result to get this done, I think we are going to need to do a series of PRs that are more incremental to avoid too much churn at once. None of them will give us the warm fuzzy feeling that things are done, but each will move us incrementally in that direction.

To that end, it would help to have some feedback on this in light of this being a step on that path. How do we feel about the fact the two objects are split here? What things would we like to keep in mind for the next PRs? Etc.?

@fjetter
Copy link
Member

fjetter commented Mar 1, 2022

Thanks. I appreciate that this is not an all-in-one PR and we have to do this incrementally. I think we're not too far apart.

The first step at least as we saw it was to break apart the inheritance structure, which I think you can now see was a fair bit of work on its own.

Indeed, this was an important step and this alone is probably reason enough for this PR to be merged

The next step would be to get this into its own file so we are not trying to compile the whole Scheduler (kind of slow, touches a bunch of complex code that doesn't necessarily benefit from Cythonization, etc.). I've already started this, but would rather put it up after this PR lands. Then we can tug further at these various threads to get Scheduler and SchedulerState more self-contained.

I think this is the point where we disagree. I'm more concerned about the overall state of the code, its maintainability and architecture than the cythonization attempt. I would prefer us cleaning up the interfaces between the two classes and finish the refactor before we invest any more time in cythonization.

We agreed on "not dropping Cython" under the condition that we can isolate the code to a dedicated module such that "other work" may not be impacted as much by the cythonization as it is right now. With this PR this would amount to about half the code of scheduler.py. My concern is that splitting these classes properly up will likely give us a split of 80/20 or even worse such that we have only two options
A) keep improper interfaces since the one module is "mostly/pure cython" and the other is "python-only". We would still have a monolithic software design. Just that it is split on multiple modules and multiple languages.
B) move >80% of the scheduler code to cython

@martindurant
Copy link
Member

I feel like the alignment between us is actually pretty good. This PR is trying to do the minimum, so that we can answer exactly those interface questions; i.e., many decisions on whether a thing is in the state or not were simply taken from the previous superclass model. Now that we separate the state, we can make a more formal decision on where each thing should be. So I suggest we don't develop further in this PR, but add each of @fjetter 's points as to-do in a the next PR. The third PR would move the state into a different file to show our intent at encapsulation, and I would aim for the state file to contain 20% of the code, rather than the other way around.

@jakirkham
Copy link
Member Author

jakirkham commented Mar 1, 2022

Generally agree with Martin's points above. Also don't think we are doing any additional Cythonization here (and that is not my current priority either though likely would be an interest in the medium term). So also seeing pretty good alignment.

The one thing I've noticed on an ergonomic side of things is that fixing merge conflicts has been very slow (for example diff resolution software has been struggling with the size of the scheduler.py file). So splitting out to another file would really help this issue, which is why I was thinking about getting this done sooner rather than later.

@martindurant
Copy link
Member

Code count in this branch:

  • SchedulerState 2037 (code+space+doc+comments)
  • Scheduler 4419 (code+space+doc+comments)
  • scheduler.py
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           1           1360           1312           5992

So we are talking ~30% of the code to be factored into a different file, depending how many functions and smaller classes except SchedulerState class go with

@fjetter
Copy link
Member

fjetter commented Mar 3, 2022

So we are talking ~30% of the code to be factored into a different file

I did a very rough split of what I thought should be done by the SchedulerState and came up with a much larger number

depending how many functions and smaller classes except SchedulerState class go with

That's mostly where my concern is. We're still unclear about what functions or smaller classes should be part of the state and what shouldn't.

I would love to talk about about what the final interface of a SchedulerState should look like. This would give me a better understanding of where we're going

@martindurant
Copy link
Member

I agree that if indeed it is ~most of the code, then the split is not worthwhile, and deciding where the demarkation line is, is an important necessary step before trying to implement the split. Still, does this affect the current PR?

@mrocklin
Copy link
Member

So, I'm not sure that I support this PR. Let me write down a couple of pros and cons:

  • Pros
    • If we do cythonization, then this helps us to isolate that code into a structure that is easier to understand than the current inheritance structure. It makes things more explicit
    • In the effort to figure out if we should do cythonization, it's helpful that it makes it clear which state belongs where
  • Cons
    • If we don't do Cythonization, then this needlessly changes how everyone who works on this codebase operates. Knowing that I can no longer touch self.workers but now have to touch self.state.workers and communicating this to everyone has some cost
    • If we don't do Cythonization, then there is a non-trivial chance that this breaks something that we're not expecting

Historically we've used main as a testing ground to figure out whether or not we should do Cythonization. This was because we were pretty sure that we wanted to do Cythonization, and because it was hard to manage the merge conflcts. I think at this point, after a year of doing this, I think that we can call this a mistake. We should not use main to experiment with Cythonization today. This PR feels like it's changing around main, which comes with the costs mentioned above, in order to answer the "should we do cythonization" question. I think that I'm mildly against merging it in. I think that we need to find out ways to answer that question without causing churn in main.

Thoughts?

@quasiben
Copy link
Member

Did @fjetter have time to go through this PR as well ?

@mrocklin
Copy link
Member

I believe that he's somewhat saturated currently, so I wouldn't be surprised if not.

@jakirkham
Copy link
Member Author

Ok well when we met last we discussed Florian would take a look at this PR. So I think we are a bit confused on what is going on here.

Maybe we need to have another meeting? Would suggest we do something next week (personally am pretty saturated this week).

@martindurant
Copy link
Member

What do you think, @jakirkham , do you still have any energy to give to this, aside from all the sunken costs? Personally, I am surprised that the scheduler as a bottleneck doesn't come up more often (yes, I know there have been improvements elsewhere); but when it IS - is there any other plausible acceleration on the horizon, or is cython the only game in town? pyston? To be sure, rewriting with something else is probably much more work than we have been considering here.

I'd be happy to discuss this question in another meeting, if we have any chance of coming to consensus.

@mrocklin
Copy link
Member

Ok well when we met last we discussed Florian would take a look at this PR. So I think we are a bit confused on what is going on here.

I'm sure that if Florian has time that he'll chime in. I think that currently he's really focused on removing deadlocks and other arguably more critical issues though.

Maybe we need to have another meeting? Would suggest we do something next week (personally am pretty saturated this week).

Sure, I'm happy to meet any time.

is there any other plausible acceleration on the horizon, or is cython the only game in town? pyston? To be sure, rewriting with something else is probably much more work than we have been considering here.

To be clear, I'm still enthusiastic about the Cythonization work, we just need to find a way to explore this without operating directly in main. We as a community probably shouldn't engage in experiments of this duration and this level of disruption in the production codebase I don't think. That doesn't mean that we can't make progress, it just means that we need to get more creative about how we do that work. I can't think of anything easier than mucking about directly in main, but there are other things to consider I think.

Do you all disagree? If so, I'd love to engage on that higher level question.

@fjetter
Copy link
Member

fjetter commented Mar 24, 2022

I had another look over this PR. I'm really torn but I'm still not convinced about this change and would prefer us to go in another direction by dropping Cython from main and reverting any complications introduced to make Cython work, i.e. revert the SchedulerState inheritance/split entirely.

I would really love to see us separating scheduling logic from server code. I believe such a modular software architecture has many advantages over the current, rather monolithic one.

This PR does not propose a modular architecture. I entirely understand that such an architecture can not be achieved with a single PR and this should be considered an increment. However, at the same time this PR draws a relatively arbitrary line to define SchedulerState. As already state, we are accessing the SchedulerState more than 200 times within Scheduler.

  • We're getting/setting 25 distinct attributes of this scheduler class and only use 14 methods.
  • There are many getters on scheduler level which just pass through to the state
    below.
  • We still have legacy attribute mappings defined on the scheduler that only
    pass down to the state
  • Many, if not all of the Scheduler.handle_* and Scheduler.stimulus_* methods do nothing else than mutate SchedulerState. Is this intentional? should they be moved to SchedulerState instead? If so, does this separation still make sense?
  • Thinking about concepts like "Single responsibility" I am struggling to come up with a responsibility where either class can work independently.
  • ...

Before this modular architecture is achieved, this intermediary state will cause pain since until then it is just another indirection in an already complex and complicated system. For me to support this PR, I would need confidence that the direction this is pursuing is correct. Right now, I'm lacking this confidence and am more concerned about the cost and possible churn this introduces.

I already mentioned that I consider this PR a slight improvement over main in a few areas. Particularly, because it removes an otherwise very confusing inheritance from the main SchedulerState. Especially in combination with Cython, which requires the use of parent all over the place, this makes things much more confusing than necessary. There are, of course, other areas were we have a slight improvement. However, this change also comes at a cost and I have the impression this cost is too high.
I keep coming back to the much simpler solution which is to drop Cython and revert this artificial split until we have a better understanding of how it is supposed to look like.

This was referenced Apr 7, 2022
@jakirkham jakirkham deleted the cy_scheduler_state_2 branch April 27, 2022 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants