Add Jupyter Server to Dask Scheduler #6737

mrocklin · 2022-07-16T19:57:36Z

This adds an option to start a Jupyter Server in the Dask Scheduler process.

This is admittedly a bit strange, but also convenient when you have a
dask scheduler running and want a quick Jupyter notebook without setting
up separate port forwarding / auth / etc..
This piggybacks on the existing Tornado server application.

This adds a new Scheduler(jupyter=...) keyword
and a dask-scheduler --jupyter flag.

It's a bit strange, but also pretty lightweight.

Done in collaboration with @Carreau and @Zsailer

This adds an option to start a Jupyter Server in the Dask Scheduler process. This is admittedly a bit strange, but also convenient when you have a dask scheduler running and want a quick Jupyter notebook without setting up separate port forwarding / auth / etc.. This piggybacks on the existing Tornado server application. This adds a new `Scheduler(jupyter=...)` keyword and a `dask-scheduler --jupyter` flag. It's a bit strange, but also pretty lightweight.

github-actions · 2022-07-16T20:43:10Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 24m 46s ⏱️ + 4m 36s
  2 978 tests +  1   2 885 ✔️ - 1     88 💤 +  1 3 ❌ - 1 2 🔥 +2
22 088 runs +16 21 035 ✔️ +3 1 046 💤 +10 5 ❌ +1 2 🔥 +2

For more details on these failures and errors, see this check.

Results for commit d6040e3. ± Comparison against base commit add3663.

♻️ This comment has been updated with latest results.

gjoseph92

I'd find this less concerning if it wasn't running in the same process (and same event loop). The issue with adding this is that users might actually... use it 😱. Making it super easy to run GIL-holding or event-loop-blocking code within the scheduler process would be a bad path to pave.

I see the value in making it convenient to spin up jupyter alongside the scheduler. But not within it. If we're going to provide this option to users, we should do it in the right way, not expose an API that encourages people to shoot themselves in the foot.

distributed/scheduler.py

mrocklin · 2022-07-18T15:57:50Z

@gjoseph92 I had the same concern until I realized that all the server does is start up kernels that then do the actual work. We won't be running user code in the same process. I'll admit that I don't know how much actual work the server does, but I suspect that it is not that heavy. Does that address your concerns?

ian-r-rose · 2022-07-18T17:35:06Z

I'll admit that I don't know how much actual work the server does, but I suspect that it is not that heavy.

One potential issue is file I/O. Users might have large notebooks or other files that they are saving/loading using the jupyter server. There was recently some work into growing an async contents manager, though I don't think it's on-by-default yet.

mrocklin · 2022-07-18T22:23:50Z

Let's say that a notebook is 10MB, on a really slow drive (such as exists on an AWS node) we're looking at a 100ms delay, right? That doesn't seem that bad. Also the user is probably distracted at that time anyway and won't notice 🙂

gjoseph92 · 2022-07-19T14:24:13Z

@mrocklin that does address my concerns, thanks. In that case this makes a lot of sense.

Would that file IO block the event loop?

mrocklin · 2022-07-19T14:54:47Z

Would that file IO block the event loop

My read of @ian-r-rose 's comment is that yes, File IO in the server (for example reading a notebook) would block the event loop (I wouldn't have expected Jupyter engineers to do async FileIO necessarily). File IO within a kernel wouldn't (but I suspect that you already know that).

I'm not terribly concerned about File IO within the Jupyter server. I expect it to be brief and innocuous.

mrocklin · 2022-07-19T23:28:55Z

Aside from the graceful import issue are there any further objections?

@gjoseph92 I'm a bit busy with management stuff this week. If you're game with this could I ask you to handle the graceful import issue and, if you're comfortable, merge? (I'm guessing that this doesn't take you more than an hour and would be mildly satisfying).

ian-r-rose · 2022-07-20T00:24:58Z

My read of @ian-r-rose 's comment is that yes, File IO in the server (for example reading a notebook) would block the event loop (I wouldn't have expected Jupyter engineers to do async FileIO necessarily). File IO within a kernel wouldn't (but I suspect that you already know that).

Yes, that is what I meant above -- it would only be an issue when saving large files from the Jupyter(Lab) UI. This could happen if the user either accidentally or purposefully has enormous outputs like images, videos, or large logging output. But I'm not too worried about it for an experimental feature like this.

distributed/scheduler.py

Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>

jacobtomlinson · 2022-07-25T10:03:39Z

distributed/scheduler.py

+                            # SECURITY: in this context we expect this to be safe, as
+                            # if a client can connect to the scheduler they can already
+                            # run arbitrary code.


How does this statement hold up if TLS is enabled? We have some open work to add authentication to the HTTP server in #6431 because this isn't currently handled.

Running Jupyter like this could allow remote code execution without authentication even if the user has configured the scheduler security. I think we discussed this in person at SciPy and my preference would still be to not allow this if TLS is enabled until we get an auth solution in like #6431.

Agreed. This runs jupyter on :8787 (or whatever the dashboard port is), right? So if you've enabled access to the dashboard over to the global internet—a common deployment pattern used in coiled, dask-cloudprovider, etc. #6407 — then anyone on the internet can launch jupyter kernels on your scheduler!

It's good that it's off by default, but I can't imagine many scenarios in which I'd want to turn it on with this authentication story. I'm not sure this feature is particularly useful until the authentication question is resolved?

I think that this should definitely be off by default for the reasons mentioned above. However, it's still valuable when authentication is handled by the system deploying Dask. I think that it makes sense to include this with a "consenting adults" mindset.

For example, in Coiled we've had secure dashboards in a PR for a while. This will probably motivate us to get that in. It would be useful in that context. I suspect that other enterprise-y systems also protect the dashboard using various auth systems external to Dask.

I think we should edit/remove this comment then, and document much more clearly in big red letters on the --jupyter flag that anyone who can access the dashboard can access jupyter, so it is insecure and should only be used when access to the dashboard is restricted behind some other authentication proxy.

Open to any language. Could I ask you to make a suggestion or push a commit with language that you'd prefer?

I've added a warning to the --jupyter flag.

Can we remove this misleading comment too? We shouldn't communicate this is safe when it isn't.

@jacobtomlinson are you suggesting we remove this comment altogether, or do you have a suggested update to what this comment should say?

I would say something along the lines of

"We usually expect the dashboard to be a read-only view into the scheduler activity. However, by adding an open Jupyter application we are allowing arbitrary remote code execution on the scheduler via the dashboard server. This option should only be used when the dashboard is protected via other means, or when you don't care about cluster security."

@mrocklin this was not resolved before merge, I've opened up #6818 to resolve

mrocklin · 2022-07-27T16:59:11Z

Any further comments? If not I plan to merge later this week.

davidbrochart · 2022-08-12T07:14:28Z

Just mentioning that an alternative to jupyter-server is being developed in https://github.com/jupyter-server/jupyverse. It is based on FastAPI, and could be used in Dask-distributed if it were to support ASGI.

This adds an option to start a Jupyter Server in the Dask Scheduler process. This is admittedly a bit strange, but also convenient when you have a dask scheduler running and want a quick Jupyter notebook without setting up separate port forwarding / auth / etc.. This piggybacks on the existing Tornado server application. This adds a new `Scheduler(jupyter=...)` keyword and a `dask-scheduler --jupyter` flag.

mrocklin added 4 commits July 16, 2022 15:47

Allow remote access

3627ccb

Allow insecure connections from localhost

153ee87

Add missing test file

71f0a4d

remove duplicate call

e1b1678

Carreau approved these changes Jul 17, 2022

View reviewed changes

gjoseph92 reviewed Jul 18, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

mrocklin mentioned this pull request Jul 18, 2022

Specify default address to look for scheduler dask/dask-labextension#239

Open

ian-r-rose reviewed Jul 20, 2022

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

mrocklin and others added 4 commits July 22, 2022 10:19

Update distributed/scheduler.py

242b34e

Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>

nicer import error

b683ece

Merge branch 'main' of github.com:dask/distributed into jupyter

ee8fb6c

remove unnecessary mypy directive

b9bfed3

jacobtomlinson reviewed Jul 25, 2022

View reviewed changes

add warning to --jupyter line

d6040e3

mrocklin merged commit 650e3f4 into dask:main Aug 1, 2022

mrocklin deleted the jupyter branch August 1, 2022 15:30

jacobtomlinson mentioned this pull request Aug 2, 2022

Fix Jupyter security note #6818

Merged

Uh oh!

Add Jupyter Server to Dask Scheduler #6737

Add Jupyter Server to Dask Scheduler #6737

Uh oh!

Conversation

mrocklin commented Jul 16, 2022

Uh oh!

github-actions bot commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrocklin commented Jul 18, 2022

Uh oh!

ian-r-rose commented Jul 18, 2022

Uh oh!

mrocklin commented Jul 18, 2022

Uh oh!

gjoseph92 commented Jul 19, 2022

Uh oh!

mrocklin commented Jul 19, 2022

Uh oh!

mrocklin commented Jul 19, 2022

Uh oh!

ian-r-rose commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 27, 2022

Uh oh!

davidbrochart commented Aug 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions bot commented Jul 16, 2022 •

edited

Loading

ian-r-rose commented Jul 20, 2022 •

edited

Loading