-
-
Notifications
You must be signed in to change notification settings - Fork 748
Add Jupyter Server to Dask Scheduler #6737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds an option to start a Jupyter Server in the Dask Scheduler process. This is admittedly a bit strange, but also convenient when you have a dask scheduler running and want a quick Jupyter notebook without setting up separate port forwarding / auth / etc.. This piggybacks on the existing Tornado server application. This adds a new `Scheduler(jupyter=...)` keyword and a `dask-scheduler --jupyter` flag. It's a bit strange, but also pretty lightweight.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 24m 46s ⏱️ + 4m 36s For more details on these failures and errors, see this check. Results for commit d6040e3. ± Comparison against base commit add3663. ♻️ This comment has been updated with latest results. |
gjoseph92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find this less concerning if it wasn't running in the same process (and same event loop). The issue with adding this is that users might actually... use it 😱. Making it super easy to run GIL-holding or event-loop-blocking code within the scheduler process would be a bad path to pave.
I see the value in making it convenient to spin up jupyter alongside the scheduler. But not within it. If we're going to provide this option to users, we should do it in the right way, not expose an API that encourages people to shoot themselves in the foot.
|
@gjoseph92 I had the same concern until I realized that all the server does is start up kernels that then do the actual work. We won't be running user code in the same process. I'll admit that I don't know how much actual work the server does, but I suspect that it is not that heavy. Does that address your concerns? |
One potential issue is file I/O. Users might have large notebooks or other files that they are saving/loading using the jupyter server. There was recently some work into growing an async contents manager, though I don't think it's on-by-default yet. |
|
Let's say that a notebook is 10MB, on a really slow drive (such as exists on an AWS node) we're looking at a 100ms delay, right? That doesn't seem that bad. Also the user is probably distracted at that time anyway and won't notice 🙂 |
|
@mrocklin that does address my concerns, thanks. In that case this makes a lot of sense. Would that file IO block the event loop? |
My read of @ian-r-rose 's comment is that yes, File IO in the server (for example reading a notebook) would block the event loop (I wouldn't have expected Jupyter engineers to do async FileIO necessarily). File IO within a kernel wouldn't (but I suspect that you already know that). I'm not terribly concerned about File IO within the Jupyter server. I expect it to be brief and innocuous. |
|
Aside from the graceful import issue are there any further objections? @gjoseph92 I'm a bit busy with management stuff this week. If you're game with this could I ask you to handle the graceful import issue and, if you're comfortable, merge? (I'm guessing that this doesn't take you more than an hour and would be mildly satisfying). |
Yes, that is what I meant above -- it would only be an issue when saving large files from the Jupyter(Lab) UI. This could happen if the user either accidentally or purposefully has enormous outputs like images, videos, or large logging output. But I'm not too worried about it for an experimental feature like this. |
Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>
| # SECURITY: in this context we expect this to be safe, as | ||
| # if a client can connect to the scheduler they can already | ||
| # run arbitrary code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this statement hold up if TLS is enabled? We have some open work to add authentication to the HTTP server in #6431 because this isn't currently handled.
Running Jupyter like this could allow remote code execution without authentication even if the user has configured the scheduler security. I think we discussed this in person at SciPy and my preference would still be to not allow this if TLS is enabled until we get an auth solution in like #6431.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This runs jupyter on :8787 (or whatever the dashboard port is), right? So if you've enabled access to the dashboard over to the global internet—a common deployment pattern used in coiled, dask-cloudprovider, etc. #6407 — then anyone on the internet can launch jupyter kernels on your scheduler!
It's good that it's off by default, but I can't imagine many scenarios in which I'd want to turn it on with this authentication story. I'm not sure this feature is particularly useful until the authentication question is resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should definitely be off by default for the reasons mentioned above. However, it's still valuable when authentication is handled by the system deploying Dask. I think that it makes sense to include this with a "consenting adults" mindset.
For example, in Coiled we've had secure dashboards in a PR for a while. This will probably motivate us to get that in. It would be useful in that context. I suspect that other enterprise-y systems also protect the dashboard using various auth systems external to Dask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should edit/remove this comment then, and document much more clearly in big red letters on the --jupyter flag that anyone who can access the dashboard can access jupyter, so it is insecure and should only be used when access to the dashboard is restricted behind some other authentication proxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to any language. Could I ask you to make a suggestion or push a commit with language that you'd prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a warning to the --jupyter flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this misleading comment too? We shouldn't communicate this is safe when it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobtomlinson are you suggesting we remove this comment altogether, or do you have a suggested update to what this comment should say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say something along the lines of
"We usually expect the dashboard to be a read-only view into the scheduler activity. However, by adding an open Jupyter application we are allowing arbitrary remote code execution on the scheduler via the dashboard server. This option should only be used when the dashboard is protected via other means, or when you don't care about cluster security."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Any further comments? If not I plan to merge later this week. |
|
Just mentioning that an alternative to jupyter-server is being developed in https://github.com/jupyter-server/jupyverse. It is based on FastAPI, and could be used in Dask-distributed if it were to support ASGI. |
This adds an option to start a Jupyter Server in the Dask Scheduler process. This is admittedly a bit strange, but also convenient when you have a dask scheduler running and want a quick Jupyter notebook without setting up separate port forwarding / auth / etc.. This piggybacks on the existing Tornado server application. This adds a new `Scheduler(jupyter=...)` keyword and a `dask-scheduler --jupyter` flag.
This adds an option to start a Jupyter Server in the Dask Scheduler process.
This is admittedly a bit strange, but also convenient when you have a
dask scheduler running and want a quick Jupyter notebook without setting
up separate port forwarding / auth / etc..
This piggybacks on the existing Tornado server application.
This adds a new
Scheduler(jupyter=...)keywordand a
dask-scheduler --jupyterflag.It's a bit strange, but also pretty lightweight.
Done in collaboration with @Carreau and @Zsailer