Skip to content

Conversation

@mrocklin
Copy link
Member

This adds an option to start a Jupyter Server in the Dask Scheduler process.

This is admittedly a bit strange, but also convenient when you have a
dask scheduler running and want a quick Jupyter notebook without setting
up separate port forwarding / auth / etc..
This piggybacks on the existing Tornado server application.

This adds a new Scheduler(jupyter=...) keyword
and a dask-scheduler --jupyter flag.

It's a bit strange, but also pretty lightweight.

Done in collaboration with @Carreau and @Zsailer

This adds an option to start a Jupyter Server in the Dask Scheduler process.

This is admittedly a bit strange, but also convenient when you have a
dask scheduler running and want a quick Jupyter notebook without setting
up separate port forwarding / auth / etc..
This piggybacks on the existing Tornado server application.

This adds a new `Scheduler(jupyter=...)` keyword
and a `dask-scheduler --jupyter` flag.

It's a bit strange, but also pretty lightweight.
@github-actions
Copy link
Contributor

github-actions bot commented Jul 16, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 24m 46s ⏱️ + 4m 36s
  2 978 tests +  1    2 885 ✔️  - 1       88 💤 +  1  3  - 1  2 🔥 +2 
22 088 runs  +16  21 035 ✔️ +3  1 046 💤 +10  5 +1  2 🔥 +2 

For more details on these failures and errors, see this check.

Results for commit d6040e3. ± Comparison against base commit add3663.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find this less concerning if it wasn't running in the same process (and same event loop). The issue with adding this is that users might actually... use it 😱. Making it super easy to run GIL-holding or event-loop-blocking code within the scheduler process would be a bad path to pave.

I see the value in making it convenient to spin up jupyter alongside the scheduler. But not within it. If we're going to provide this option to users, we should do it in the right way, not expose an API that encourages people to shoot themselves in the foot.

@mrocklin
Copy link
Member Author

@gjoseph92 I had the same concern until I realized that all the server does is start up kernels that then do the actual work. We won't be running user code in the same process. I'll admit that I don't know how much actual work the server does, but I suspect that it is not that heavy. Does that address your concerns?

@ian-r-rose
Copy link
Collaborator

I'll admit that I don't know how much actual work the server does, but I suspect that it is not that heavy.

One potential issue is file I/O. Users might have large notebooks or other files that they are saving/loading using the jupyter server. There was recently some work into growing an async contents manager, though I don't think it's on-by-default yet.

@mrocklin
Copy link
Member Author

Let's say that a notebook is 10MB, on a really slow drive (such as exists on an AWS node) we're looking at a 100ms delay, right? That doesn't seem that bad. Also the user is probably distracted at that time anyway and won't notice 🙂

@gjoseph92
Copy link
Collaborator

@mrocklin that does address my concerns, thanks. In that case this makes a lot of sense.

Would that file IO block the event loop?

@mrocklin
Copy link
Member Author

Would that file IO block the event loop

My read of @ian-r-rose 's comment is that yes, File IO in the server (for example reading a notebook) would block the event loop (I wouldn't have expected Jupyter engineers to do async FileIO necessarily). File IO within a kernel wouldn't (but I suspect that you already know that).

I'm not terribly concerned about File IO within the Jupyter server. I expect it to be brief and innocuous.

@mrocklin
Copy link
Member Author

Aside from the graceful import issue are there any further objections?

@gjoseph92 I'm a bit busy with management stuff this week. If you're game with this could I ask you to handle the graceful import issue and, if you're comfortable, merge? (I'm guessing that this doesn't take you more than an hour and would be mildly satisfying).

@ian-r-rose
Copy link
Collaborator

ian-r-rose commented Jul 20, 2022

My read of @ian-r-rose 's comment is that yes, File IO in the server (for example reading a notebook) would block the event loop (I wouldn't have expected Jupyter engineers to do async FileIO necessarily). File IO within a kernel wouldn't (but I suspect that you already know that).

Yes, that is what I meant above -- it would only be an issue when saving large files from the Jupyter(Lab) UI. This could happen if the user either accidentally or purposefully has enormous outputs like images, videos, or large logging output. But I'm not too worried about it for an experimental feature like this.

Comment on lines +3023 to +3025
# SECURITY: in this context we expect this to be safe, as
# if a client can connect to the scheduler they can already
# run arbitrary code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this statement hold up if TLS is enabled? We have some open work to add authentication to the HTTP server in #6431 because this isn't currently handled.

Running Jupyter like this could allow remote code execution without authentication even if the user has configured the scheduler security. I think we discussed this in person at SciPy and my preference would still be to not allow this if TLS is enabled until we get an auth solution in like #6431.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This runs jupyter on :8787 (or whatever the dashboard port is), right? So if you've enabled access to the dashboard over to the global internet—a common deployment pattern used in coiled, dask-cloudprovider, etc. #6407 — then anyone on the internet can launch jupyter kernels on your scheduler!

It's good that it's off by default, but I can't imagine many scenarios in which I'd want to turn it on with this authentication story. I'm not sure this feature is particularly useful until the authentication question is resolved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should definitely be off by default for the reasons mentioned above. However, it's still valuable when authentication is handled by the system deploying Dask. I think that it makes sense to include this with a "consenting adults" mindset.

For example, in Coiled we've had secure dashboards in a PR for a while. This will probably motivate us to get that in. It would be useful in that context. I suspect that other enterprise-y systems also protect the dashboard using various auth systems external to Dask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should edit/remove this comment then, and document much more clearly in big red letters on the --jupyter flag that anyone who can access the dashboard can access jupyter, so it is insecure and should only be used when access to the dashboard is restricted behind some other authentication proxy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to any language. Could I ask you to make a suggestion or push a commit with language that you'd prefer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a warning to the --jupyter flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this misleading comment too? We shouldn't communicate this is safe when it isn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobtomlinson are you suggesting we remove this comment altogether, or do you have a suggested update to what this comment should say?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something along the lines of

"We usually expect the dashboard to be a read-only view into the scheduler activity. However, by adding an open Jupyter application we are allowing arbitrary remote code execution on the scheduler via the dashboard server. This option should only be used when the dashboard is protected via other means, or when you don't care about cluster security."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin this was not resolved before merge, I've opened up #6818 to resolve

@mrocklin
Copy link
Member Author

Any further comments? If not I plan to merge later this week.

@mrocklin mrocklin merged commit 650e3f4 into dask:main Aug 1, 2022
@mrocklin mrocklin deleted the jupyter branch August 1, 2022 15:30
@davidbrochart
Copy link

Just mentioning that an alternative to jupyter-server is being developed in https://github.com/jupyter-server/jupyverse. It is based on FastAPI, and could be used in Dask-distributed if it were to support ASGI.

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
This adds an option to start a Jupyter Server in the Dask Scheduler process.

This is admittedly a bit strange, but also convenient when you have a
dask scheduler running and want a quick Jupyter notebook without setting
up separate port forwarding / auth / etc..
This piggybacks on the existing Tornado server application.

This adds a new `Scheduler(jupyter=...)` keyword
and a `dask-scheduler --jupyter` flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants