From 87f33fd1f0cd4bb965bf24b91a0c17c89e228b71 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Wed, 9 Nov 2022 15:01:09 -0800 Subject: [PATCH 1/8] docs: add juptyer API tutorial for API and jupyter tutorial index (#3) (cherry picked from commit aeb8d9e3390fa26d9c533dce0862295b80c58583) --- .gitignore | 1 + docs/tutorials/tutorial-jupyter-index.md | 55 +++ .../quickstart/jupyter-notebooks/README.md | 46 ++ .../jupyter-notebooks/api-tutorial.ipynb | 442 ++++++++++++++++++ website/sidebars.json | 3 +- 5 files changed, 546 insertions(+), 1 deletion(-) create mode 100644 docs/tutorials/tutorial-jupyter-index.md create mode 100644 examples/quickstart/jupyter-notebooks/README.md create mode 100644 examples/quickstart/jupyter-notebooks/api-tutorial.ipynb diff --git a/.gitignore b/.gitignore index f906e2426793..febece8879f3 100644 --- a/.gitignore +++ b/.gitignore @@ -27,3 +27,4 @@ README integration-tests/gen-scripts/ /bin/ *.hprof +**/.ipynb_checkpoints/ \ No newline at end of file diff --git a/docs/tutorials/tutorial-jupyter-index.md b/docs/tutorials/tutorial-jupyter-index.md new file mode 100644 index 000000000000..3d7fae739001 --- /dev/null +++ b/docs/tutorials/tutorial-jupyter-index.md @@ -0,0 +1,55 @@ +--- +id: tutorial-jupyter-index +title: "Jupyter Notebook tutorials" +--- + + + + + +You can try out the Druid APIs using the Jupyter Notebook-based tutorials. These tutorials provide snippets of Python code that you can use to run calls against the Druid API. + +## Before you start + +Make sure you meet the following requirements before starting the Jupyter-based tutorials: + +- Python3 + +- The `requests` package for Python. For example, you can install it with the following command: + + ```bash + pip3 install requests + ```` + +- Jupyter Lab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port. For example, use the following command to start Jupyter Lab on port `3001`: + + ```bash + jupyter lab --port 3001 + ``` + +- An available Druid instance. You can use the `micro-quickstart` configuration described in [Quickstart (local)](./index.md). The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. + +## Tutorials + +The notebooks are located in the [apache/druid repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/). You can either clone the repo or download the notebooks you want individually. + +The links that follow are the raw GitHub URLs, so you can use them to download the notebook directly, such as with `wget`, or manually through your web browser. Note that if you save the file from your web browser, make sure to remove the `.txt` extension. + +- [Introduction to the Druid API](https://raw.githubusercontent.com/apache/druid/master/api-tutorial-jupyter-nb/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb) walks you through some of the basics related to the Druid API and several endpoints. \ No newline at end of file diff --git a/examples/quickstart/jupyter-notebooks/README.md b/examples/quickstart/jupyter-notebooks/README.md new file mode 100644 index 000000000000..7c6e33d3d821 --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/README.md @@ -0,0 +1,46 @@ +# Jupyter notebook tutorials for Druid + + + +You can try out the Druid APIs using the Jupyter Notebook-based tutorials. These tutorials provide snippets of Python code that you can use to run calls against the Druid API. + +## Before you start + +Make sure you meet the following requirements before starting the Jupyter-based tutorials: + +- Python3 + +- The `requests` package for Python. For example, you can install it with the following command: + + ```bash + pip3 install requests + ```` + +- Jupyter Lab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port. For example, use the following command to start Jupyter Lab on port `3001`: + + ```bash + jupyter lab --port 3001 + ``` + +- An available Druid instance. You can use the `micro-quickstart` configuration described in [Quickstart (local)](../../../docs/tutorials/index.md). The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. + +## Tutorials + +The notebooks are located in the [apache/druid repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/). You can either clone the repo or download the notebooks you want individually. + +The links that follow are the raw GitHub URLs, so you can use them to download the notebook directly, such as with `wget`, or manually through your web browser. Note that if you save the file from your web browser, make sure to remove the `.txt` extension. + +- [Introduction to the Druid API](https://raw.githubusercontent.com/apache/druid/master/api-tutorial-jupyter-nb/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb) walks you through some of the basics related to the Druid API and several endpoints. + +## Contributing + +If you build a Jupyter tutorial, you need to do a few things to add it to the docs in addition to saving the notebook in this directory: + +- Clear the outputs from your notebook before you make the PR. You can use the following command: + + ```bash + jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace ./path/to/notebook/notebookName.ipynb + ``` + +- Update the list of [Tutorials](#tutorials) on this page and in the [ Jupyter tutorial index page](../../../docs/tutorials/tutorial-jupyter-index.md#tutorials) in the `docs/tutorials` directory. When updating `tutorial-jupyter-index.md`, make sure you provide the URL to the raw version of the file. Since you need to specify a branch, the URL will 404 until your PR is merged and the file exists on master. + diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb new file mode 100644 index 000000000000..2006dc20ede5 --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -0,0 +1,442 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ad4e60b6", + "metadata": { + "tags": [] + }, + "source": [ + "# Tutorial: Learn the basics of the Druid API\n", + "\n", + "\n", + " \n", + "This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently, including the following tasks:\n", + "\n", + "- Checking if your cluster is up\n", + "- Ingesting data\n", + "- Querying data\n", + "- Deleting data\n", + "\n", + "In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n", + "\n", + "## Table of contents\n", + "\n", + "- [Before you start](#Before-you-start)\n", + "- [Get basic cluster information](#Get-basic-cluster-information)\n", + "- [Ingest data](#Ingest-data)\n", + "- [Query your data](#Query-your-data)\n", + "- [Manage your data](#Manage-your-data)\n", + "- [Next steps](#Next-steps)\n", + "\n", + "For the best experience, use Jupyter Lab so that you can always access the table of contents." + ] + }, + { + "cell_type": "markdown", + "id": "8d6bbbcb", + "metadata": { + "tags": [] + }, + "source": [ + "## Requirements\n", + "\n", + "You'll need install the Requests library for Python before you start. For example:\n", + "\n", + "```bash\n", + "pip3 install requests\n", + "```\n", + "\n", + "Next, you'll need a Druid cluster. This tutorial uses the `micro-quickstart` config described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So download that and start it if you haven't already. In the root of the Druid folder, run the following command to start Druid:\n", + "\n", + "```bash\n", + "./bin/start-micro-quickstart\n", + "```\n", + "\n", + "Finally, you'll need either Jupyter lab (recommended) or Jupyter notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed at `localhost:8888` by default, so you'll \n", + "need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n", + "\n", + "```bash\n", + "# If you're using Jupyter lab\n", + "jupyter lab --port 3001\n", + "# If you're using Jupyter notebook\n", + "jupyter notebook --port 3001 \n", + "```\n", + "\n", + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7f08a52", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "\n", + "# druid_host is the hostname and port for your Druid deployment. \n", + "druid_host = \"http://localhost:8888\"\n", + "dataSourceName = \"wikipedia_api\"\n", + "print(druid_host)" + ] + }, + { + "cell_type": "markdown", + "id": "2093ecf0-fb4b-405b-a216-094583580e0a", + "metadata": {}, + "source": [ + "In the rest of this tutorial, the `endpoint`, `http_method`, and `payload` variables are updated in code cells to call a different Druid endpoint to accomplish a task." + ] + }, + { + "cell_type": "markdown", + "id": "29c24856", + "metadata": { + "tags": [] + }, + "source": [ + "## Get basic cluster information\n", + "\n", + "In this cell, you'll use the `GET /status` endpoint to return basic information about your cluster, such as the Druid version, loaded extensions, and resource consumption.\n", + "\n", + "The following cell sets `endpoint` to `/status` and updates the HTTP method to `GET`. When you run the cell, you should get a response that starts with the version number of your Druid deployment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "baa140b8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "endpoint = \"/status\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(json.dumps(json.loads(response.text), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "cbeb5a63", + "metadata": { + "tags": [] + }, + "source": [ + "### Get cluster health\n", + "\n", + "The `/status/health` endpoint returns `true` if your cluster is up and running. It's useful if you want to do something like programmatically check if your cluster is available. When you run the following cell, you should get `true` if your Druid cluster has finished starting up and is running." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e51170e", + "metadata": {}, + "outputs": [], + "source": [ + "# GET \n", + "endpoint = \"/status/health\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3", + "metadata": { + "tags": [] + }, + "source": [ + "## Ingest data\n", + "\n", + "Now that you've confirmed that your cluster is up and running, you can start ingesting data. There are different ways to ingest data based on what your needs are. For more information, see [Ingestion methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n", + "\n", + "This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the query, context, and parameters fields\n", + "\n", + "To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n", + "\n", + "\n", + "The next cell does the following:\n", + "\n", + "- Includes a payload that inserts data from an external source into a table named wikipedia_api. The payload is in JSON format and included in the code directly. You can also store it in a file and provide the file. \n", + "- Saves the response to a unique variable that you can reference later to identify this ingestion task\n", + "\n", + "The example uses INSERT, but you could also use REPLACE. \n", + "\n", + "For the MSQ task engine, ingesting data is done through a task, so the response includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "362b6a87", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql/task\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"POST\"\n", + "\n", + "\n", + "payload = json.dumps({\n", + "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") AS __time, * FROM TABLE( EXTERN( '{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]' ) ) PARTITIONED BY DAY\",\n", + " \"context\": {\n", + " \"maxNumTasks\": 3\n", + " }\n", + "})\n", + "\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestiion_taskId_response = response\n", + "print(response.text + f\"\\nInserting data into the table named {dataSourceName}.\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7", + "metadata": { + "tags": [] + }, + "source": [ + "Extract the `taskId` value from the `taskId_response` variable so that you can reference it later:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f578b9b2", + "metadata": {}, + "outputs": [], + "source": [ + "ingestion_taskId = json.loads(ingestiion_taskId_response.text)['taskId']\n", + "print(ingestion_taskId)" + ] + }, + { + "cell_type": "markdown", + "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72", + "metadata": { + "tags": [] + }, + "source": [ + "### Get the status of your task\n", + "\n", + "The following cell shows you how to get the status of your ingestion. You can see basic information about your query, such as when it started and whether or not it's finished.\n", + "\n", + "In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdbab6ae", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(json.dumps(json.loads(response.text), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "3b55af57-9c79-4e45-a22c-438c1b94112e", + "metadata": { + "tags": [] + }, + "source": [ + "## Query your data\n", + "\n", + "When you ingest data into Druid, Druid stores the data in a datasource, and this datasource is what you run queries against.\n", + "\n", + "### List your datasources\n", + "\n", + "You can get a list of datasources from the `/druid/coordinator/v1/datasources` endpoint. Since you're just getting started, there should only be a single datasource, the `wikipedia_api` table you created earlier when you ingested external data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "959e3c9b", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/coordinator/v1/datasources\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f", + "metadata": { + "tags": [] + }, + "source": [ + "### Query your data\n", + "\n", + "Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "694900d0-891f-41bd-9b45-5ae957385244", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT * FROM wikipedia_api LIMIT 3\"\n", + "})\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "\n", + "print(json.dumps(json.loads(response.text), indent=4))\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "950b2cc4-9935-497d-a3f5-e89afcc85965", + "metadata": { + "tags": [] + }, + "source": [ + "In addition to the query, there are a few additional things you can define within the payload. For a full list, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", + "\n", + "This tutorial uses a context parameter and a dynamic parameter.\n", + "\n", + "Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily like when you need to cancel a query.\n", + "\n", + "\n", + "Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n", + "\n", + "\n", + "The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT * FROM wikipedia_api WHERE __time > ? LIMIT 1\",\n", + " \"context\": {\n", + " \"sqlQueryId\" : \"important-query\" \n", + " },\n", + " \"parameters\": [\n", + " { \"type\": \"TIMESTAMP\", \"value\": \"2016-06-27\"}\n", + " ]\n", + "})\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(json.dumps(json.loads(response.text), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e", + "metadata": { + "tags": [] + }, + "source": [ + "## Next steps\n", + "\n", + "This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:\n", + "\n", + "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", + "- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n", + "\n", + "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n", + "\n", + "\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.8" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/website/sidebars.json b/website/sidebars.json index a398f9fd3642..393390fa6b2a 100644 --- a/website/sidebars.json +++ b/website/sidebars.json @@ -22,7 +22,8 @@ "tutorials/tutorial-transform-spec", "tutorials/docker", "tutorials/tutorial-kerberos-hadoop", - "tutorials/tutorial-msq-convert-spec" + "tutorials/tutorial-msq-convert-spec", + "tutorials/tutorial-jupyter-index" ], "Design": [ "design/architecture", From 133adc6f25b737048613508150c718f6039b96f0 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Mon, 28 Nov 2022 12:47:27 -0800 Subject: [PATCH 2/8] update prereqs and fix jupyterlab name --- docs/tutorials/tutorial-jupyter-index.md | 28 +++++++--- .../quickstart/jupyter-notebooks/README.md | 51 +++++++++++++++---- website/.spelling | 1 + 3 files changed, 62 insertions(+), 18 deletions(-) diff --git a/docs/tutorials/tutorial-jupyter-index.md b/docs/tutorials/tutorial-jupyter-index.md index 3d7fae739001..55ea5011eee6 100644 --- a/docs/tutorials/tutorial-jupyter-index.md +++ b/docs/tutorials/tutorial-jupyter-index.md @@ -24,13 +24,13 @@ title: "Jupyter Notebook tutorials" -You can try out the Druid APIs using the Jupyter Notebook-based tutorials. These tutorials provide snippets of Python code that you can use to run calls against the Druid API. +You can try out the Druid APIs using the Jupyter Notebook-based tutorials. These tutorials provide snippets of Python code that you can use to run calls against the Druid API to complete the tutorial. -## Before you start +## Prerequisites Make sure you meet the following requirements before starting the Jupyter-based tutorials: -- Python3 +- Python 3 - The `requests` package for Python. For example, you can install it with the following command: @@ -38,11 +38,23 @@ Make sure you meet the following requirements before starting the Jupyter-based pip3 install requests ```` -- Jupyter Lab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port. For example, use the following command to start Jupyter Lab on port `3001`: - - ```bash - jupyter lab --port 3001 - ``` +- JupyterLab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port. + + - Install JupyterLab or Notebook: + + ```bash + # Install JupyterLab + pip3 install jupyterlab + # Install Jupyter Notebook + pip3 install notebook + ``` + - Start JupyterLab + ```bash + # Start JupyterLab on port 3001 + jupyter lab --port 3001 + # Start Jupyter notebook on port 3001 + jupyter notebook --port 3001 + ``` - An available Druid instance. You can use the `micro-quickstart` configuration described in [Quickstart (local)](./index.md). The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. diff --git a/examples/quickstart/jupyter-notebooks/README.md b/examples/quickstart/jupyter-notebooks/README.md index 7c6e33d3d821..36788e10a4bd 100644 --- a/examples/quickstart/jupyter-notebooks/README.md +++ b/examples/quickstart/jupyter-notebooks/README.md @@ -1,14 +1,33 @@ -# Jupyter notebook tutorials for Druid +# Jupyter Notebook tutorials for Druid -You can try out the Druid APIs using the Jupyter Notebook-based tutorials. These tutorials provide snippets of Python code that you can use to run calls against the Druid API. - -## Before you start + + +You can try out the Druid APIs using the Jupyter Notebook-based tutorials. These tutorials provide snippets of Python code that you can use to run calls against the Druid API to complete the tutorial. + +## Prerequisites Make sure you meet the following requirements before starting the Jupyter-based tutorials: -- Python3 +- Python 3 - The `requests` package for Python. For example, you can install it with the following command: @@ -16,11 +35,23 @@ Make sure you meet the following requirements before starting the Jupyter-based pip3 install requests ```` -- Jupyter Lab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port. For example, use the following command to start Jupyter Lab on port `3001`: - - ```bash - jupyter lab --port 3001 - ``` +- JupyterLab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port. + + - Install JupyterLab or Notebook: + + ```bash + # Install JupyterLab + pip3 install jupyterlab + # Install Jupyter Notebook + pip3 install notebook + ``` + - Start JupyterLab + ```bash + # Start JupyterLab on port 3001 + jupyter lab --port 3001 + # Start Jupyter notebook on port 3001 + jupyter notebook --port 3001 + ``` - An available Druid instance. You can use the `micro-quickstart` configuration described in [Quickstart (local)](../../../docs/tutorials/index.md). The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. diff --git a/website/.spelling b/website/.spelling index fc177de24f8d..d31383b53d1a 100644 --- a/website/.spelling +++ b/website/.spelling @@ -137,6 +137,7 @@ JVMs Joda JsonProperty Jupyter +JupyterLab KMS Kerberized Kerberos From 9a49a4ad56943bbb6cf4cea7a7388162ecf82c15 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Wed, 30 Nov 2022 09:47:29 -0800 Subject: [PATCH 3/8] Removing notebook since 13345 has it 13345 should be merged first --- .../jupyter-notebooks/api-tutorial.ipynb | 442 ------------------ 1 file changed, 442 deletions(-) delete mode 100644 examples/quickstart/jupyter-notebooks/api-tutorial.ipynb diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb deleted file mode 100644 index 2006dc20ede5..000000000000 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ /dev/null @@ -1,442 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "ad4e60b6", - "metadata": { - "tags": [] - }, - "source": [ - "# Tutorial: Learn the basics of the Druid API\n", - "\n", - "\n", - " \n", - "This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently, including the following tasks:\n", - "\n", - "- Checking if your cluster is up\n", - "- Ingesting data\n", - "- Querying data\n", - "- Deleting data\n", - "\n", - "In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n", - "\n", - "## Table of contents\n", - "\n", - "- [Before you start](#Before-you-start)\n", - "- [Get basic cluster information](#Get-basic-cluster-information)\n", - "- [Ingest data](#Ingest-data)\n", - "- [Query your data](#Query-your-data)\n", - "- [Manage your data](#Manage-your-data)\n", - "- [Next steps](#Next-steps)\n", - "\n", - "For the best experience, use Jupyter Lab so that you can always access the table of contents." - ] - }, - { - "cell_type": "markdown", - "id": "8d6bbbcb", - "metadata": { - "tags": [] - }, - "source": [ - "## Requirements\n", - "\n", - "You'll need install the Requests library for Python before you start. For example:\n", - "\n", - "```bash\n", - "pip3 install requests\n", - "```\n", - "\n", - "Next, you'll need a Druid cluster. This tutorial uses the `micro-quickstart` config described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So download that and start it if you haven't already. In the root of the Druid folder, run the following command to start Druid:\n", - "\n", - "```bash\n", - "./bin/start-micro-quickstart\n", - "```\n", - "\n", - "Finally, you'll need either Jupyter lab (recommended) or Jupyter notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed at `localhost:8888` by default, so you'll \n", - "need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n", - "\n", - "```bash\n", - "# If you're using Jupyter lab\n", - "jupyter lab --port 3001\n", - "# If you're using Jupyter notebook\n", - "jupyter notebook --port 3001 \n", - "```\n", - "\n", - "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b7f08a52", - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "import json\n", - "\n", - "# druid_host is the hostname and port for your Druid deployment. \n", - "druid_host = \"http://localhost:8888\"\n", - "dataSourceName = \"wikipedia_api\"\n", - "print(druid_host)" - ] - }, - { - "cell_type": "markdown", - "id": "2093ecf0-fb4b-405b-a216-094583580e0a", - "metadata": {}, - "source": [ - "In the rest of this tutorial, the `endpoint`, `http_method`, and `payload` variables are updated in code cells to call a different Druid endpoint to accomplish a task." - ] - }, - { - "cell_type": "markdown", - "id": "29c24856", - "metadata": { - "tags": [] - }, - "source": [ - "## Get basic cluster information\n", - "\n", - "In this cell, you'll use the `GET /status` endpoint to return basic information about your cluster, such as the Druid version, loaded extensions, and resource consumption.\n", - "\n", - "The following cell sets `endpoint` to `/status` and updates the HTTP method to `GET`. When you run the cell, you should get a response that starts with the version number of your Druid deployment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "baa140b8", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "endpoint = \"/status\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"GET\"\n", - "\n", - "payload = {}\n", - "headers = {}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(json.loads(response.text), indent=4))" - ] - }, - { - "cell_type": "markdown", - "id": "cbeb5a63", - "metadata": { - "tags": [] - }, - "source": [ - "### Get cluster health\n", - "\n", - "The `/status/health` endpoint returns `true` if your cluster is up and running. It's useful if you want to do something like programmatically check if your cluster is available. When you run the following cell, you should get `true` if your Druid cluster has finished starting up and is running." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5e51170e", - "metadata": {}, - "outputs": [], - "source": [ - "# GET \n", - "endpoint = \"/status/health\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"GET\"\n", - "\n", - "payload = {}\n", - "headers = {}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "\n", - "print(response.text)" - ] - }, - { - "cell_type": "markdown", - "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3", - "metadata": { - "tags": [] - }, - "source": [ - "## Ingest data\n", - "\n", - "Now that you've confirmed that your cluster is up and running, you can start ingesting data. There are different ways to ingest data based on what your needs are. For more information, see [Ingestion methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n", - "\n", - "This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the query, context, and parameters fields\n", - "\n", - "To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n", - "\n", - "\n", - "The next cell does the following:\n", - "\n", - "- Includes a payload that inserts data from an external source into a table named wikipedia_api. The payload is in JSON format and included in the code directly. You can also store it in a file and provide the file. \n", - "- Saves the response to a unique variable that you can reference later to identify this ingestion task\n", - "\n", - "The example uses INSERT, but you could also use REPLACE. \n", - "\n", - "For the MSQ task engine, ingesting data is done through a task, so the response includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "362b6a87", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = \"/druid/v2/sql/task\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"POST\"\n", - "\n", - "\n", - "payload = json.dumps({\n", - "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") AS __time, * FROM TABLE( EXTERN( '{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]' ) ) PARTITIONED BY DAY\",\n", - " \"context\": {\n", - " \"maxNumTasks\": 3\n", - " }\n", - "})\n", - "\n", - "headers = {'Content-Type': 'application/json'}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "ingestiion_taskId_response = response\n", - "print(response.text + f\"\\nInserting data into the table named {dataSourceName}.\")" - ] - }, - { - "cell_type": "markdown", - "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7", - "metadata": { - "tags": [] - }, - "source": [ - "Extract the `taskId` value from the `taskId_response` variable so that you can reference it later:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f578b9b2", - "metadata": {}, - "outputs": [], - "source": [ - "ingestion_taskId = json.loads(ingestiion_taskId_response.text)['taskId']\n", - "print(ingestion_taskId)" - ] - }, - { - "cell_type": "markdown", - "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72", - "metadata": { - "tags": [] - }, - "source": [ - "### Get the status of your task\n", - "\n", - "The following cell shows you how to get the status of your ingestion. You can see basic information about your query, such as when it started and whether or not it's finished.\n", - "\n", - "In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fdbab6ae", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"GET\"\n", - "\n", - "payload = {}\n", - "headers = {}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(json.loads(response.text), indent=4))" - ] - }, - { - "cell_type": "markdown", - "id": "3b55af57-9c79-4e45-a22c-438c1b94112e", - "metadata": { - "tags": [] - }, - "source": [ - "## Query your data\n", - "\n", - "When you ingest data into Druid, Druid stores the data in a datasource, and this datasource is what you run queries against.\n", - "\n", - "### List your datasources\n", - "\n", - "You can get a list of datasources from the `/druid/coordinator/v1/datasources` endpoint. Since you're just getting started, there should only be a single datasource, the `wikipedia_api` table you created earlier when you ingested external data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "959e3c9b", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = \"/druid/coordinator/v1/datasources\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"GET\"\n", - "\n", - "payload = {}\n", - "headers = {}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(response.text)" - ] - }, - { - "cell_type": "markdown", - "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f", - "metadata": { - "tags": [] - }, - "source": [ - "### Query your data\n", - "\n", - "Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "694900d0-891f-41bd-9b45-5ae957385244", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = \"/druid/v2/sql\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"POST\"\n", - "\n", - "payload = json.dumps({\n", - " \"query\": \"SELECT * FROM wikipedia_api LIMIT 3\"\n", - "})\n", - "headers = {'Content-Type': 'application/json'}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "\n", - "print(json.dumps(json.loads(response.text), indent=4))\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "950b2cc4-9935-497d-a3f5-e89afcc85965", - "metadata": { - "tags": [] - }, - "source": [ - "In addition to the query, there are a few additional things you can define within the payload. For a full list, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", - "\n", - "This tutorial uses a context parameter and a dynamic parameter.\n", - "\n", - "Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily like when you need to cancel a query.\n", - "\n", - "\n", - "Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n", - "\n", - "\n", - "The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = \"/druid/v2/sql\"\n", - "print(druid_host+endpoint)\n", - "http_method = \"POST\"\n", - "\n", - "payload = json.dumps({\n", - " \"query\": \"SELECT * FROM wikipedia_api WHERE __time > ? LIMIT 1\",\n", - " \"context\": {\n", - " \"sqlQueryId\" : \"important-query\" \n", - " },\n", - " \"parameters\": [\n", - " { \"type\": \"TIMESTAMP\", \"value\": \"2016-06-27\"}\n", - " ]\n", - "})\n", - "headers = {'Content-Type': 'application/json'}\n", - "\n", - "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(json.loads(response.text), indent=4))" - ] - }, - { - "cell_type": "markdown", - "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e", - "metadata": { - "tags": [] - }, - "source": [ - "## Next steps\n", - "\n", - "This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:\n", - "\n", - "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", - "- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n", - "\n", - "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n", - "\n", - "\n", - "\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.8" - }, - "vscode": { - "interpreter": { - "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 66eca3bb73b1f96cdd66e334b7167eeec8b5ff95 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Wed, 30 Nov 2022 10:14:08 -0800 Subject: [PATCH 4/8] update contributing instructions --- examples/quickstart/jupyter-notebooks/README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/README.md b/examples/quickstart/jupyter-notebooks/README.md index 36788e10a4bd..625ee005fda5 100644 --- a/examples/quickstart/jupyter-notebooks/README.md +++ b/examples/quickstart/jupyter-notebooks/README.md @@ -65,13 +65,21 @@ The links that follow are the raw GitHub URLs, so you can use them to download t ## Contributing -If you build a Jupyter tutorial, you need to do a few things to add it to the docs in addition to saving the notebook in this directory: +If you build a Jupyter tutorial, you need to do a few things to add it to the docs in addition to saving the notebook in this directory. The process requires two PRs to the repo. -- Clear the outputs from your notebook before you make the PR. You can use the following command: +For the first PR, do the following: + +1. Clear the outputs from your notebook before you make the PR. You can use the following command: ```bash jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace ./path/to/notebook/notebookName.ipynb ``` -- Update the list of [Tutorials](#tutorials) on this page and in the [ Jupyter tutorial index page](../../../docs/tutorials/tutorial-jupyter-index.md#tutorials) in the `docs/tutorials` directory. When updating `tutorial-jupyter-index.md`, make sure you provide the URL to the raw version of the file. Since you need to specify a branch, the URL will 404 until your PR is merged and the file exists on master. +2. Create the PR as you normally would. Make sure to note that this PR is the one that contains only the Jupyter notebook and that there will be a subsequent PR that updates related pages. + +3. After this first PR is merged, grab the "raw" URL for the file from GitHub. For example, navigate to the file in the GitHub web UI and select **Raw**. Use the URL for this in the second PR as the download link. + +For the second PR, do the following: +1. Update the list of [Tutorials](#tutorials) on this page and in the [ Jupyter tutorial index page](../../../docs/tutorials/tutorial-jupyter-index.md#tutorials) in the `docs/tutorials` directory. +2. Updating `tutorial-jupyter-index.md` and provide the URL to the raw version of the file that becomes available after the first PR is merged. From cc6dd6b3abd2860bfdaf61a5225eaeeb529bb4e0 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Thu, 15 Dec 2022 13:30:46 -0800 Subject: [PATCH 5/8] fix download link --- docs/tutorials/tutorial-jupyter-index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-jupyter-index.md b/docs/tutorials/tutorial-jupyter-index.md index 55ea5011eee6..16c5bd416268 100644 --- a/docs/tutorials/tutorial-jupyter-index.md +++ b/docs/tutorials/tutorial-jupyter-index.md @@ -64,4 +64,4 @@ The notebooks are located in the [apache/druid repo](https://github.com/apache/d The links that follow are the raw GitHub URLs, so you can use them to download the notebook directly, such as with `wget`, or manually through your web browser. Note that if you save the file from your web browser, make sure to remove the `.txt` extension. -- [Introduction to the Druid API](https://raw.githubusercontent.com/apache/druid/master/api-tutorial-jupyter-nb/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb) walks you through some of the basics related to the Druid API and several endpoints. \ No newline at end of file +- [Introduction to the Druid API](https://raw.githubusercontent.com/apache/druid/master/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb) walks you through some of the basics related to the Druid API and several endpoints. \ No newline at end of file From 2da8364c1ad1a0a34f9c3c5389f09b1f3b837c6e Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Thu, 15 Dec 2022 13:33:24 -0800 Subject: [PATCH 6/8] change to use relative path --- examples/quickstart/jupyter-notebooks/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/README.md b/examples/quickstart/jupyter-notebooks/README.md index 625ee005fda5..fef280c921d5 100644 --- a/examples/quickstart/jupyter-notebooks/README.md +++ b/examples/quickstart/jupyter-notebooks/README.md @@ -61,7 +61,7 @@ The notebooks are located in the [apache/druid repo](https://github.com/apache/d The links that follow are the raw GitHub URLs, so you can use them to download the notebook directly, such as with `wget`, or manually through your web browser. Note that if you save the file from your web browser, make sure to remove the `.txt` extension. -- [Introduction to the Druid API](https://raw.githubusercontent.com/apache/druid/master/api-tutorial-jupyter-nb/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb) walks you through some of the basics related to the Druid API and several endpoints. +- [Introduction to the Druid API](api-tutorial.ipynb) walks you through some of the basics related to the Druid API and several endpoints. ## Contributing @@ -82,4 +82,4 @@ For the first PR, do the following: For the second PR, do the following: 1. Update the list of [Tutorials](#tutorials) on this page and in the [ Jupyter tutorial index page](../../../docs/tutorials/tutorial-jupyter-index.md#tutorials) in the `docs/tutorials` directory. -2. Updating `tutorial-jupyter-index.md` and provide the URL to the raw version of the file that becomes available after the first PR is merged. +2. Update `tutorial-jupyter-index.md` and provide the URL to the raw version of the file that becomes available after the first PR is merged. From ae7a44024591fbff792a6f969f398794d9526615 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Thu, 15 Dec 2022 14:37:57 -0800 Subject: [PATCH 7/8] separate jupyter start cmds --- docs/tutorials/tutorial-jupyter-index.md | 16 ++++++++++------ .../quickstart/jupyter-notebooks/README.md | 18 +++++++++++------- 2 files changed, 21 insertions(+), 13 deletions(-) diff --git a/docs/tutorials/tutorial-jupyter-index.md b/docs/tutorials/tutorial-jupyter-index.md index 16c5bd416268..d23192fb435b 100644 --- a/docs/tutorials/tutorial-jupyter-index.md +++ b/docs/tutorials/tutorial-jupyter-index.md @@ -48,12 +48,16 @@ Make sure you meet the following requirements before starting the Jupyter-based # Install Jupyter Notebook pip3 install notebook ``` - - Start JupyterLab - ```bash - # Start JupyterLab on port 3001 - jupyter lab --port 3001 - # Start Jupyter notebook on port 3001 - jupyter notebook --port 3001 + - Start Jupyter + - JupyterLab + ```bash + # Start JupyterLab on port 3001 + jupyter lab --port 3001 + ``` + - Jupyter Notebook + ```bash + # Start Jupyter Notebook on port 3001 + jupyter notebook --port 3001 ``` - An available Druid instance. You can use the `micro-quickstart` configuration described in [Quickstart (local)](./index.md). The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. diff --git a/examples/quickstart/jupyter-notebooks/README.md b/examples/quickstart/jupyter-notebooks/README.md index fef280c921d5..7e5fa2becaee 100644 --- a/examples/quickstart/jupyter-notebooks/README.md +++ b/examples/quickstart/jupyter-notebooks/README.md @@ -45,13 +45,17 @@ Make sure you meet the following requirements before starting the Jupyter-based # Install Jupyter Notebook pip3 install notebook ``` - - Start JupyterLab - ```bash - # Start JupyterLab on port 3001 - jupyter lab --port 3001 - # Start Jupyter notebook on port 3001 - jupyter notebook --port 3001 - ``` + - Start Jupyter: + - JupyterLab + ```bash + # Start JupyterLab on port 3001 + jupyter lab --port 3001 + ``` + - Jupyter Notebook + ```bash + # Start Jupyter Notebook on port 3001 + jupyter notebook --port 3001 + ``` - An available Druid instance. You can use the `micro-quickstart` configuration described in [Quickstart (local)](../../../docs/tutorials/index.md). The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. From c1f004a7417d84b2b28f2d194363f9ed6fe578c2 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Thu, 15 Dec 2022 16:36:01 -0800 Subject: [PATCH 8/8] fix typo --- docs/tutorials/tutorial-jupyter-index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-jupyter-index.md b/docs/tutorials/tutorial-jupyter-index.md index d23192fb435b..233b9fda50f5 100644 --- a/docs/tutorials/tutorial-jupyter-index.md +++ b/docs/tutorials/tutorial-jupyter-index.md @@ -36,7 +36,7 @@ Make sure you meet the following requirements before starting the Jupyter-based ```bash pip3 install requests - ```` + ``` - JupyterLab (recommended) or Jupyter Notebook running on a non-default port. By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on a different port.