Skip to content

Conversation

@mrocklin
Copy link
Member

cc @sjperkins this is what I was thinking. Thoughts?

@mrocklin
Copy link
Member Author

Hrm, another thought that came up is checking dask.config.get("annotations") when we call compute/persist/_graph_to_futures. This would support workflows like

with dask.annotate(retries=10):
    x.compute()

@jrbourbeau
Copy link
Member

another thought that came up is checking dask.config.get("annotations") when we call compute/persist/_graph_to_futures

Support for adding compute-time annotations seems sensible. Expanding your snippet above:

x = ... # Create Dask collection here w/o annotations
with dask.annotate(retries=10):
    # Add compute-time annotations
    x.compute()

We'll want to make sure that compute-time annotations take precedent over any overlapping annotations which may already exist on HLG layers.

@sjperkins
Copy link
Member

cc @sjperkins this is what I was thinking. Thoughts?

dask/dask#6889 and #4279 do currently handle transmission of arbitrary annotations to the scheduler (and SchedulerPlugins)

To me this looks like a special case handling of the existing distributed Client.{submit,compute,persist} taxonomy. This is handled here in #4279 and tested here.

I think the advantage of this PR's approach is that it saves the scheduler work unpacking the existing taxonomy out of the transmitted annotations. Also, one thing that didn't immediately occur to me is that this approach will provide backward compatibility for dictionary graphs.

I've merged this branch into dask/dask#6889 -- hope this OK. Let me add support for resources and resources and loose restrictions in that PR. I'll work on that now and open up both the dask and distributed PR's for review this evening (UTC + 02h00).

@mrocklin
Copy link
Member Author

mrocklin commented Dec 1, 2020

dask/dask#6889 and #4279 do currently handle transmission of arbitrary annotations to the scheduler (and SchedulerPlugins)

Ah right. My original approach to this was "let's just handle the annotations that we know about now, which we can do simply.
I don't see a need for full arbitrary annotations today"

However, you're right in that users may have needs here with their own code, as in with SchedulerPlugins.

I was hoping to find a simpler solution than your existing PRs, but let me take another look with this use case in mind. Thanks for pushing back here.

@sjperkins
Copy link
Member

sjperkins commented Dec 1, 2020

I was hoping to find a simpler solution than your existing PRs, but let me take another look with this use case in mind. Thanks for pushing back here.

Could you critique the current complexity in more detail? I think part of it may stem from the current Layer transmission mechanisms. My impression is that this is currently in flux.

@mrocklin
Copy link
Member Author

mrocklin commented Dec 2, 2020

Could you critique the current complexity in more detail? I think part of it may stem from the current Layer transmission mechanisms. My impression is that this is currently in flux.

The code is fine. I gave it a more thorough pass just now. I was a bit pedantic in some cases where minor code complexity arose just to make a point. I think that I'm probably more picky than most when it comes to reducing indirection in code.

My original point wasn't about the shape of your code in particular. It was more about the machinery involved in packing/unpacking things. I think that you're trying to do a more thorough job than I was trying to do here, and so your solution was understandably more complex.

I do think that what is in this PR is simpler, mostly because it only relies on systems that feel pretty solid today.

@dhirschfeld
Copy link
Contributor

My original approach to this was "let's just handle the annotations that we know about now, which we can do simply. I don't see a need for full arbitrary annotations today"

As someone who has been eagerly anticipating the use-cases a general annotation framework can enable I'd very much like a general rather than limited/specific solution to be implemented now.

In the last year it seems there's been a lot of focus from a number of core contributors on this difficult problem and I feel the project is pretty close to a workable general-purpose solution acceptable to everyone. I think if the project were to stop now with a limited solution that it may be significantly more difficult to generalise later on when everyone has moved on to other problems.

A general purpose solution will necessarily be more complex than a limited/specific solution but I think the use-cases it enables will justify the additional complexity and I think with the focus this feature is getting now that there will never be a better time to design/implement the more complex solution.

@sjperkins
Copy link
Member

I do think that what is in this PR is simpler, mostly because it only relies on systems that feel pretty solid today.

This is undoubtedly true, but I think it will suffer a similar problem to that experienced with Client.{submit, persist, compute} kwargs: expanding the taxonomy or collection scheduling functionality relies on adding more args/kwargs to update-graph ops with potential knock-on effects to every downstream project. I think the complexity surrounding those issues are far greater than the code complexity introduced by annotations.

It may be worth revisiting the points originally discussed here

  1. Modifying existing scheduler behaviour is risky for the existing dask user base.
  2. Encoding annotations in the graph provides hints to upstream schedulers that they may respect, but are under no obligation to do so.
  3. As it currently stands, users may
    • Create graphs via the collections (Array, Dataframe, Bag) and trust the scheduler to schedule work optimally.
    • Optimally schedule work imperatively via the distributed Client interface.
  4. Neither approach seems entirely satisfactory: Collections are great abstractions which are disappointing to discard for the optimal compute/data placement afforded by the Client interface.
  5. Annotations therefore bridge the gap between the two paradigms by providing an interface or glue layer.

Expert users can now write custom SchedulerPlugins which implement collection specific scheduling behaviour based on annotations without affecting downstream projects.

If the plugins are very effective, they can then be added to the scheduler by default, or their functionality integrated into the scheduler.

@mrocklin
Copy link
Member Author

mrocklin commented Dec 2, 2020

I'm totally fine going with the general approach. My comment about this being simpler because it relies on solid systems today was more of a statement of "This approach is only simpler because I'm solving a weaker problem" rather than "Let's go with this because it's simpler"

@mrocklin
Copy link
Member Author

mrocklin commented Dec 2, 2020

I have no specific desire to merge in this PR over yours. I do have some mild requests on yours though.

@sjperkins
Copy link
Member

I have no specific desire to merge in this PR over yours. I do have some mild requests on yours though.

I do appreciate the feedback: your comments have improved the PRs. I think I've addressed most of your existing requests, except for one point which I think requires input from others who've worked on the Layer hierarchy.

@mrocklin mrocklin closed this Dec 10, 2020
@mrocklin mrocklin deleted the annotations branch January 4, 2021 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants