From 308ae7ff1072376c8b042315e53a9fdee66a13b7 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Wed, 9 Nov 2022 16:06:53 -0800 Subject: [PATCH 1/7] docs: notebook for API tutorial --- .gitignore | 1 + .../jupyter-notebooks/api-tutorial.ipynb | 442 ++++++++++++++++++ 2 files changed, 443 insertions(+) create mode 100644 examples/quickstart/jupyter-notebooks/api-tutorial.ipynb diff --git a/.gitignore b/.gitignore index f906e2426793..90ac70d93dd8 100644 --- a/.gitignore +++ b/.gitignore @@ -27,3 +27,4 @@ README integration-tests/gen-scripts/ /bin/ *.hprof +*.ipynb_checkpoints/ \ No newline at end of file diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb new file mode 100644 index 000000000000..2006dc20ede5 --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -0,0 +1,442 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ad4e60b6", + "metadata": { + "tags": [] + }, + "source": [ + "# Tutorial: Learn the basics of the Druid API\n", + "\n", + "\n", + " \n", + "This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently, including the following tasks:\n", + "\n", + "- Checking if your cluster is up\n", + "- Ingesting data\n", + "- Querying data\n", + "- Deleting data\n", + "\n", + "In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n", + "\n", + "## Table of contents\n", + "\n", + "- [Before you start](#Before-you-start)\n", + "- [Get basic cluster information](#Get-basic-cluster-information)\n", + "- [Ingest data](#Ingest-data)\n", + "- [Query your data](#Query-your-data)\n", + "- [Manage your data](#Manage-your-data)\n", + "- [Next steps](#Next-steps)\n", + "\n", + "For the best experience, use Jupyter Lab so that you can always access the table of contents." + ] + }, + { + "cell_type": "markdown", + "id": "8d6bbbcb", + "metadata": { + "tags": [] + }, + "source": [ + "## Requirements\n", + "\n", + "You'll need install the Requests library for Python before you start. For example:\n", + "\n", + "```bash\n", + "pip3 install requests\n", + "```\n", + "\n", + "Next, you'll need a Druid cluster. This tutorial uses the `micro-quickstart` config described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So download that and start it if you haven't already. In the root of the Druid folder, run the following command to start Druid:\n", + "\n", + "```bash\n", + "./bin/start-micro-quickstart\n", + "```\n", + "\n", + "Finally, you'll need either Jupyter lab (recommended) or Jupyter notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed at `localhost:8888` by default, so you'll \n", + "need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n", + "\n", + "```bash\n", + "# If you're using Jupyter lab\n", + "jupyter lab --port 3001\n", + "# If you're using Jupyter notebook\n", + "jupyter notebook --port 3001 \n", + "```\n", + "\n", + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7f08a52", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "\n", + "# druid_host is the hostname and port for your Druid deployment. \n", + "druid_host = \"http://localhost:8888\"\n", + "dataSourceName = \"wikipedia_api\"\n", + "print(druid_host)" + ] + }, + { + "cell_type": "markdown", + "id": "2093ecf0-fb4b-405b-a216-094583580e0a", + "metadata": {}, + "source": [ + "In the rest of this tutorial, the `endpoint`, `http_method`, and `payload` variables are updated in code cells to call a different Druid endpoint to accomplish a task." + ] + }, + { + "cell_type": "markdown", + "id": "29c24856", + "metadata": { + "tags": [] + }, + "source": [ + "## Get basic cluster information\n", + "\n", + "In this cell, you'll use the `GET /status` endpoint to return basic information about your cluster, such as the Druid version, loaded extensions, and resource consumption.\n", + "\n", + "The following cell sets `endpoint` to `/status` and updates the HTTP method to `GET`. When you run the cell, you should get a response that starts with the version number of your Druid deployment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "baa140b8", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "endpoint = \"/status\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(json.dumps(json.loads(response.text), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "cbeb5a63", + "metadata": { + "tags": [] + }, + "source": [ + "### Get cluster health\n", + "\n", + "The `/status/health` endpoint returns `true` if your cluster is up and running. It's useful if you want to do something like programmatically check if your cluster is available. When you run the following cell, you should get `true` if your Druid cluster has finished starting up and is running." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e51170e", + "metadata": {}, + "outputs": [], + "source": [ + "# GET \n", + "endpoint = \"/status/health\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3", + "metadata": { + "tags": [] + }, + "source": [ + "## Ingest data\n", + "\n", + "Now that you've confirmed that your cluster is up and running, you can start ingesting data. There are different ways to ingest data based on what your needs are. For more information, see [Ingestion methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n", + "\n", + "This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the query, context, and parameters fields\n", + "\n", + "To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n", + "\n", + "\n", + "The next cell does the following:\n", + "\n", + "- Includes a payload that inserts data from an external source into a table named wikipedia_api. The payload is in JSON format and included in the code directly. You can also store it in a file and provide the file. \n", + "- Saves the response to a unique variable that you can reference later to identify this ingestion task\n", + "\n", + "The example uses INSERT, but you could also use REPLACE. \n", + "\n", + "For the MSQ task engine, ingesting data is done through a task, so the response includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "362b6a87", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql/task\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"POST\"\n", + "\n", + "\n", + "payload = json.dumps({\n", + "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") AS __time, * FROM TABLE( EXTERN( '{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]' ) ) PARTITIONED BY DAY\",\n", + " \"context\": {\n", + " \"maxNumTasks\": 3\n", + " }\n", + "})\n", + "\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestiion_taskId_response = response\n", + "print(response.text + f\"\\nInserting data into the table named {dataSourceName}.\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7", + "metadata": { + "tags": [] + }, + "source": [ + "Extract the `taskId` value from the `taskId_response` variable so that you can reference it later:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f578b9b2", + "metadata": {}, + "outputs": [], + "source": [ + "ingestion_taskId = json.loads(ingestiion_taskId_response.text)['taskId']\n", + "print(ingestion_taskId)" + ] + }, + { + "cell_type": "markdown", + "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72", + "metadata": { + "tags": [] + }, + "source": [ + "### Get the status of your task\n", + "\n", + "The following cell shows you how to get the status of your ingestion. You can see basic information about your query, such as when it started and whether or not it's finished.\n", + "\n", + "In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fdbab6ae", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(json.dumps(json.loads(response.text), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "3b55af57-9c79-4e45-a22c-438c1b94112e", + "metadata": { + "tags": [] + }, + "source": [ + "## Query your data\n", + "\n", + "When you ingest data into Druid, Druid stores the data in a datasource, and this datasource is what you run queries against.\n", + "\n", + "### List your datasources\n", + "\n", + "You can get a list of datasources from the `/druid/coordinator/v1/datasources` endpoint. Since you're just getting started, there should only be a single datasource, the `wikipedia_api` table you created earlier when you ingested external data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "959e3c9b", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/coordinator/v1/datasources\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f", + "metadata": { + "tags": [] + }, + "source": [ + "### Query your data\n", + "\n", + "Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "694900d0-891f-41bd-9b45-5ae957385244", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT * FROM wikipedia_api LIMIT 3\"\n", + "})\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "\n", + "print(json.dumps(json.loads(response.text), indent=4))\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "950b2cc4-9935-497d-a3f5-e89afcc85965", + "metadata": { + "tags": [] + }, + "source": [ + "In addition to the query, there are a few additional things you can define within the payload. For a full list, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", + "\n", + "This tutorial uses a context parameter and a dynamic parameter.\n", + "\n", + "Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily like when you need to cancel a query.\n", + "\n", + "\n", + "Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n", + "\n", + "\n", + "The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql\"\n", + "print(druid_host+endpoint)\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT * FROM wikipedia_api WHERE __time > ? LIMIT 1\",\n", + " \"context\": {\n", + " \"sqlQueryId\" : \"important-query\" \n", + " },\n", + " \"parameters\": [\n", + " { \"type\": \"TIMESTAMP\", \"value\": \"2016-06-27\"}\n", + " ]\n", + "})\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "print(json.dumps(json.loads(response.text), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e", + "metadata": { + "tags": [] + }, + "source": [ + "## Next steps\n", + "\n", + "This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:\n", + "\n", + "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", + "- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n", + "\n", + "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n", + "\n", + "\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.8" + }, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From e4eb60a2e72e57db0952aab4b34c3a0144cf5b5e Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Tue, 22 Nov 2022 15:12:25 -0800 Subject: [PATCH 2/7] Apply suggestions from code review Co-authored-by: Charles Smith --- examples/quickstart/jupyter-notebooks/api-tutorial.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb index 2006dc20ede5..85dfb7ed0636 100644 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -184,7 +184,7 @@ "\n", "Now that you've confirmed that your cluster is up and running, you can start ingesting data. There are different ways to ingest data based on what your needs are. For more information, see [Ingestion methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n", "\n", - "This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the query, context, and parameters fields\n", + "This tutorial uses the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint accepts [SQL requests in the JSON-over-HTTP format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body) using the `query`, `context`, and `parameters` fields\n", "\n", "To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n", "\n", @@ -255,7 +255,7 @@ "source": [ "### Get the status of your task\n", "\n", - "The following cell shows you how to get the status of your ingestion. You can see basic information about your query, such as when it started and whether or not it's finished.\n", + "The following cell shows you how to get the status of your ingestion task. You can see basic information about your query, such as when it started and whether or not it's finished.\n", "\n", "In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial." ] @@ -357,7 +357,7 @@ "\n", "This tutorial uses a context parameter and a dynamic parameter.\n", "\n", - "Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily like when you need to cancel a query.\n", + "Context parameters can control certain characteristics related to a query, such as configuring a custom timeout. For information, see [Context parameters](https://druid.apache.org/docs/latest/querying/query-context.html). In the example query that follows, the context block assigns a custom `sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a custom ID, you can use it to reference the query more easily. For example, if you need to cancel a query.\n", "\n", "\n", "Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n", From c0e19ef115281a9c9ad684578282011e893bb08e Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Tue, 22 Nov 2022 16:29:40 -0800 Subject: [PATCH 3/7] address the other comments --- .../jupyter-notebooks/api-tutorial.ipynb | 354 ++++++++++++++++-- 1 file changed, 324 insertions(+), 30 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb index 85dfb7ed0636..d6c64c93d81b 100644 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -28,18 +28,19 @@ " ~ under the License.\n", " -->\n", " \n", - "This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently, including the following tasks:\n", + "This tutorial introduces you to the basics of the Druid API and some of the endpoints you might use frequently to perform tasks, such as the following:\n", "\n", "- Checking if your cluster is up\n", "- Ingesting data\n", "- Querying data\n", - "- Deleting data\n", "\n", - "In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n", + "Different [Druid server types](https://druid.apache.org/docs/latest/design/processes.html#server-types) are responsible for handling different APIs for the Druid services. For example, make API calls to the Overlord service on the Master server to get the status of a task. You'll also interact the Broker service on the Query Server to see what datasources are available. And to run queries, you'll interact with the Router on the Query server.\n", + "\n", + "For more information, see the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html), which is organized by server type.\n", "\n", "## Table of contents\n", "\n", - "- [Before you start](#Before-you-start)\n", + "- [Prerequisites](#Prerequisites)\n", "- [Get basic cluster information](#Get-basic-cluster-information)\n", "- [Ingest data](#Ingest-data)\n", "- [Query your data](#Query-your-data)\n", @@ -56,7 +57,7 @@ "tags": [] }, "source": [ - "## Requirements\n", + "## Prerequisites\n", "\n", "You'll need install the Requests library for Python before you start. For example:\n", "\n", @@ -80,15 +81,23 @@ "jupyter notebook --port 3001 \n", "```\n", "\n", - "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. " + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid's Router service to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. This is the port for the Router, which direct your API call to the appropriate service for most things." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "id": "b7f08a52", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "http://localhost:8888\n" + ] + } + ], "source": [ "import requests\n", "import json\n", @@ -123,12 +132,117 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "baa140b8", "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "http://localhost:8888/status\n", + "{\n", + " \"version\": \"24.0.0\",\n", + " \"modules\": [\n", + " {\n", + " \"name\": \"org.apache.druid.common.gcp.GcpModule\",\n", + " \"artifact\": \"druid-gcp-common\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.common.aws.AWSModule\",\n", + " \"artifact\": \"druid-aws-common\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.storage.hdfs.HdfsStorageDruidModule\",\n", + " \"artifact\": \"druid-hdfs-storage\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.indexing.kafka.KafkaIndexTaskModule\",\n", + " \"artifact\": \"druid-kafka-indexing-service\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.query.aggregation.datasketches.theta.SketchModule\",\n", + " \"artifact\": \"druid-datasketches\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.query.aggregation.datasketches.theta.oldapi.OldApiSketchModule\",\n", + " \"artifact\": \"druid-datasketches\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.query.aggregation.datasketches.quantiles.DoublesSketchModule\",\n", + " \"artifact\": \"druid-datasketches\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule\",\n", + " \"artifact\": \"druid-datasketches\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.query.aggregation.datasketches.hll.HllSketchModule\",\n", + " \"artifact\": \"druid-datasketches\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.msq.guice.MSQExternalDataSourceModule\",\n", + " \"artifact\": \"druid-multi-stage-query\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.msq.guice.MSQIndexingModule\",\n", + " \"artifact\": \"druid-multi-stage-query\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.msq.guice.MSQDurableStorageModule\",\n", + " \"artifact\": \"druid-multi-stage-query\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.msq.guice.MSQServiceClientModule\",\n", + " \"artifact\": \"druid-multi-stage-query\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.msq.guice.MSQSqlModule\",\n", + " \"artifact\": \"druid-multi-stage-query\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.msq.guice.SqlTaskModule\",\n", + " \"artifact\": \"druid-multi-stage-query\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.data.input.protobuf.ProtobufExtensionsModule\",\n", + " \"artifact\": \"druid-protobuf-extensions\",\n", + " \"version\": \"24.0.0\"\n", + " },\n", + " {\n", + " \"name\": \"org.apache.druid.data.input.avro.AvroExtensionsModule\",\n", + " \"artifact\": \"druid-avro-extensions\",\n", + " \"version\": \"24.0.0\"\n", + " }\n", + " ],\n", + " \"memory\": {\n", + " \"maxMemory\": 134217728,\n", + " \"totalMemory\": 134217728,\n", + " \"freeMemory\": 88157184,\n", + " \"usedMemory\": 46060544,\n", + " \"directMemory\": 134217728\n", + " }\n", + "}\n" + ] + } + ], "source": [ "endpoint = \"/status\"\n", "print(druid_host+endpoint)\n", @@ -138,7 +252,7 @@ "headers = {}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(json.loads(response.text), indent=4))" + "print(json.dumps(response.json(), indent=4))" ] }, { @@ -196,15 +310,25 @@ "\n", "The example uses INSERT, but you could also use REPLACE. \n", "\n", - "For the MSQ task engine, ingesting data is done through a task, so the response includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it." + "The MSQ task engine uses a task to ingest data. The response for the API includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 79, "id": "362b6a87", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "http://localhost:8888/druid/v2/sql/task\n", + "{\"taskId\":\"query-e6ee8e33-9d9a-4b8d-b54e-54978be36b2c\",\"state\":\"RUNNING\"}\n", + "Inserting data into the table named wikipedia_api.\n" + ] + } + ], "source": [ "endpoint = \"/druid/v2/sql/task\"\n", "print(druid_host+endpoint)\n", @@ -212,7 +336,10 @@ "\n", "\n", "payload = json.dumps({\n", - "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") AS __time, * FROM TABLE( EXTERN( '{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]' ) ) PARTITIONED BY DAY\",\n", + "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", + " AS __time, * FROM TABLE \\\n", + " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n", + " PARTITIONED BY DAY\",\n", " \"context\": {\n", " \"maxNumTasks\": 3\n", " }\n", @@ -237,10 +364,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 71, "id": "f578b9b2", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query-3e7b7c8f-0014-425d-a7a6-495f6a876819\n" + ] + } + ], "source": [ "ingestion_taskId = json.loads(ingestiion_taskId_response.text)['taskId']\n", "print(ingestion_taskId)" @@ -255,27 +390,86 @@ "source": [ "### Get the status of your task\n", "\n", - "The following cell shows you how to get the status of your ingestion task. You can see basic information about your query, such as when it started and whether or not it's finished.\n", + "The following cell shows you how to get the status of your ingestion task. The example continues to run API calls against the endpoint to fetch the status until the ingestion task completes. When it's done, you'll see the JSON response.\n", + "\n", + "You can see basic information about your query, such as when it started and whether or not it's finished.\n", "\n", "In addition to the status, you can retrieve a full report about it if you want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need that information for this tutorial." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 76, "id": "fdbab6ae", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "http://localhost:8888/druid/indexer/v1/task/query-3e7b7c8f-0014-425d-a7a6-495f6a876819/status\n", + "The ingestion is complete:\n", + "{\n", + " \"task\": \"query-3e7b7c8f-0014-425d-a7a6-495f6a876819\",\n", + " \"status\": {\n", + " \"id\": \"query-3e7b7c8f-0014-425d-a7a6-495f6a876819\",\n", + " \"groupId\": \"query-3e7b7c8f-0014-425d-a7a6-495f6a876819\",\n", + " \"type\": \"query_controller\",\n", + " \"createdTime\": \"2022-11-23T00:10:00.529Z\",\n", + " \"queueInsertionTime\": \"1970-01-01T00:00:00.000Z\",\n", + " \"statusCode\": \"SUCCESS\",\n", + " \"status\": \"SUCCESS\",\n", + " \"runnerStatusCode\": \"WAITING\",\n", + " \"duration\": 97332,\n", + " \"location\": {\n", + " \"host\": \"localhost\",\n", + " \"port\": 8100,\n", + " \"tlsPort\": -1\n", + " },\n", + " \"dataSource\": \"wikipedia_api3\",\n", + " \"errorMsg\": null\n", + " }\n", + "}\n" + ] + } + ], "source": [ + "import time\n", + "\n", "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", "print(druid_host+endpoint)\n", "http_method = \"GET\"\n", "\n", + "\n", "payload = {}\n", "headers = {}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(json.loads(response.text), indent=4))" + "ingestion_status = json.loads(response.text)['status']['status']\n", + "# If you only want to fetch the status only once and print it, \n", + "# uncomment the print statement and comment out the if and while loops\n", + "# print(json.dumps(response.json(), indent=4))\n", + "\n", + "\n", + "if ingestion_status == \"RUNNING\":\n", + " print(\"The ingestion is running...\")\n", + "\n", + "while ingestion_status != \"SUCCESS\":\n", + " response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + " ingestion_status = json.loads(response.text)['status']['status']\n", + " time.sleep(15) \n", + " \n", + "if ingestion_status == \"SUCCESS\": \n", + " print(\"The ingestion is complete:\")\n", + " print(json.dumps(response.json(), indent=4))\n" + ] + }, + { + "cell_type": "markdown", + "id": "1336ddd5-8a42-41af-8913-533316221c52", + "metadata": {}, + "source": [ + "Wait until your ingestion completes before proceeding. Depending on what else is happening in your Druid cluster and the resources available, ingestion can take some time." ] }, { @@ -296,10 +490,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 41, "id": "959e3c9b", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "http://localhost:8888/druid/coordinator/v1/datasources\n", + "[\"transactions\",\"wikipedia\",\"wikipedia-kafka\",\"wikipedia_api\"]\n" + ] + } + ], "source": [ "endpoint = \"/druid/coordinator/v1/datasources\"\n", "print(druid_host+endpoint)\n", @@ -321,15 +524,106 @@ "source": [ "### Query your data\n", "\n", - "Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes.\n" + "Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes because each row is a JSON object. In actual use cases, you'll want to only select the rows that you need. For more information about the kinds of things you can do, see [Druid SQL](https://druid.apache.org/docs/latest/querying/sql.html).\n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 42, "id": "694900d0-891f-41bd-9b45-5ae957385244", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "http://localhost:8888/druid/v2/sql\n", + "[\n", + " {\n", + " \"__time\": \"2016-06-27T00:00:11.080Z\",\n", + " \"added\": 31,\n", + " \"channel\": \"#sv.wikipedia\",\n", + " \"cityName\": \"\",\n", + " \"comment\": \"Botskapande Indonesien omdirigering\",\n", + " \"commentLength\": 35,\n", + " \"countryIsoCode\": \"\",\n", + " \"countryName\": \"\",\n", + " \"deleted\": 0,\n", + " \"delta\": 31,\n", + " \"deltaBucket\": \"0.0\",\n", + " \"diffUrl\": \"https://sv.wikipedia.org/w/index.php?oldid=36099284&rcid=89369918\",\n", + " \"flags\": \"NB\",\n", + " \"isAnonymous\": \"false\",\n", + " \"isMinor\": \"false\",\n", + " \"isNew\": \"true\",\n", + " \"isRobot\": \"true\",\n", + " \"isUnpatrolled\": \"false\",\n", + " \"metroCode\": \"\",\n", + " \"namespace\": \"Main\",\n", + " \"page\": \"Salo Toraut\",\n", + " \"regionIsoCode\": \"\",\n", + " \"regionName\": \"\",\n", + " \"timestamp\": \"2016-06-27T00:00:11.080Z\",\n", + " \"user\": \"Lsjbot\"\n", + " },\n", + " {\n", + " \"__time\": \"2016-06-27T00:00:17.457Z\",\n", + " \"added\": 125,\n", + " \"channel\": \"#ja.wikipedia\",\n", + " \"cityName\": \"\",\n", + " \"comment\": \"70\\u5e74\\u4ee3\",\n", + " \"commentLength\": 4,\n", + " \"countryIsoCode\": \"\",\n", + " \"countryName\": \"\",\n", + " \"deleted\": 0,\n", + " \"delta\": 125,\n", + " \"deltaBucket\": \"100.0\",\n", + " \"diffUrl\": \"https://ja.wikipedia.org/w/index.php?diff=60239890&oldid=60239620\",\n", + " \"flags\": \"\",\n", + " \"isAnonymous\": \"false\",\n", + " \"isMinor\": \"false\",\n", + " \"isNew\": \"false\",\n", + " \"isRobot\": \"false\",\n", + " \"isUnpatrolled\": \"false\",\n", + " \"metroCode\": \"\",\n", + " \"namespace\": \"\\u5229\\u7528\\u8005\",\n", + " \"page\": \"\\u5229\\u7528\\u8005:\\u30ef\\u30fc\\u30ca\\u30fc\\u6210\\u5897/\\u653e\\u9001\\u30a6\\u30fc\\u30de\\u30f3\\u8cde\",\n", + " \"regionIsoCode\": \"\",\n", + " \"regionName\": \"\",\n", + " \"timestamp\": \"2016-06-27T00:00:17.457Z\",\n", + " \"user\": \"\\u30ef\\u30fc\\u30ca\\u30fc\\u6210\\u5897\"\n", + " },\n", + " {\n", + " \"__time\": \"2016-06-27T00:00:34.959Z\",\n", + " \"added\": 2,\n", + " \"channel\": \"#en.wikipedia\",\n", + " \"cityName\": \"Buenos Aires\",\n", + " \"comment\": \"/* Scores */\",\n", + " \"commentLength\": 12,\n", + " \"countryIsoCode\": \"AR\",\n", + " \"countryName\": \"Argentina\",\n", + " \"deleted\": 0,\n", + " \"delta\": 2,\n", + " \"deltaBucket\": \"0.0\",\n", + " \"diffUrl\": \"https://en.wikipedia.org/w/index.php?diff=727144213&oldid=727144184\",\n", + " \"flags\": \"\",\n", + " \"isAnonymous\": \"true\",\n", + " \"isMinor\": \"false\",\n", + " \"isNew\": \"false\",\n", + " \"isRobot\": \"false\",\n", + " \"isUnpatrolled\": \"false\",\n", + " \"metroCode\": \"\",\n", + " \"namespace\": \"Main\",\n", + " \"page\": \"Bailando 2015\",\n", + " \"regionIsoCode\": \"C\",\n", + " \"regionName\": \"Buenos Aires F.D.\",\n", + " \"timestamp\": \"2016-06-27T00:00:34.959Z\",\n", + " \"user\": \"181.230.118.178\"\n", + " }\n", + "]\n" + ] + } + ], "source": [ "endpoint = \"/druid/v2/sql\"\n", "print(druid_host+endpoint)\n", @@ -342,7 +636,7 @@ "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", "\n", - "print(json.dumps(json.loads(response.text), indent=4))\n", + "print(json.dumps(response.json(), indent=4))\n", "\n" ] }, @@ -389,7 +683,7 @@ "headers = {'Content-Type': 'application/json'}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(json.loads(response.text), indent=4))" + "print(json.dumps(response.json(), indent=4))" ] }, { @@ -406,7 +700,7 @@ "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", "- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n", "\n", - "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n", + "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by Paul Rogers, a Druid contributor.\n", "\n", "\n", "\n" @@ -429,7 +723,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.8" + "version": "3.9.15" }, "vscode": { "interpreter": { From 915c5a953272b81721249011d7468df8117e57b1 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Tue, 22 Nov 2022 16:30:41 -0800 Subject: [PATCH 4/7] typo --- examples/quickstart/jupyter-notebooks/api-tutorial.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb index d6c64c93d81b..7c9efd1f1ff6 100644 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -446,7 +446,7 @@ "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", "ingestion_status = json.loads(response.text)['status']['status']\n", - "# If you only want to fetch the status only once and print it, \n", + "# If you only want to fetch the status once and print it, \n", "# uncomment the print statement and comment out the if and while loops\n", "# print(json.dumps(response.json(), indent=4))\n", "\n", From 5594ec693b656c5744d293f5846d9824277b072a Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Fri, 2 Dec 2022 13:43:31 -0800 Subject: [PATCH 5/7] add commentary to outputs --- .../jupyter-notebooks/api-tutorial.ipynb | 344 +++--------------- 1 file changed, 44 insertions(+), 300 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb index 7c9efd1f1ff6..e5031aa3aa63 100644 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -34,7 +34,7 @@ "- Ingesting data\n", "- Querying data\n", "\n", - "Different [Druid server types](https://druid.apache.org/docs/latest/design/processes.html#server-types) are responsible for handling different APIs for the Druid services. For example, make API calls to the Overlord service on the Master server to get the status of a task. You'll also interact the Broker service on the Query Server to see what datasources are available. And to run queries, you'll interact with the Router on the Query server.\n", + "Different [Druid server types](https://druid.apache.org/docs/latest/design/processes.html#server-types) are responsible for handling different APIs for the Druid services. For example, the Overlord service on the Master server provides the status of a task. You'll also interact the Broker service on the Query Server to see what datasources are available. And to run queries, you'll interact with the Broker. The Router service on the Query servers routes API calls.\n", "\n", "For more information, see the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html), which is organized by server type.\n", "\n", @@ -44,10 +44,9 @@ "- [Get basic cluster information](#Get-basic-cluster-information)\n", "- [Ingest data](#Ingest-data)\n", "- [Query your data](#Query-your-data)\n", - "- [Manage your data](#Manage-your-data)\n", "- [Next steps](#Next-steps)\n", "\n", - "For the best experience, use Jupyter Lab so that you can always access the table of contents." + "For the best experience, use JupyterLab so that you can always access the table of contents." ] }, { @@ -71,41 +70,34 @@ "./bin/start-micro-quickstart\n", "```\n", "\n", - "Finally, you'll need either Jupyter lab (recommended) or Jupyter notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed at `localhost:8888` by default, so you'll \n", + "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` by default, so you'll \n", "need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n", "\n", "```bash\n", - "# If you're using Jupyter lab\n", + "# If you're using JupyterLab\n", "jupyter lab --port 3001\n", - "# If you're using Jupyter notebook\n", + "# If you're using Jupyter Notebook\n", "jupyter notebook --port 3001 \n", "```\n", "\n", - "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host the tutorial uses. The quickstart deployment configures Druid's Router service to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. This is the port for the Router, which direct your API call to the appropriate service for most things." + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host, where the Router service listens. " ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "b7f08a52", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "http://localhost:8888\n" - ] - } - ], + "outputs": [], "source": [ "import requests\n", "import json\n", "\n", "# druid_host is the hostname and port for your Druid deployment. \n", + "# In a distributed environment, use the Router service as the `druid_host`. \n", "druid_host = \"http://localhost:8888\"\n", "dataSourceName = \"wikipedia_api\"\n", - "print(druid_host)" + "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")" ] }, { @@ -132,127 +124,22 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "baa140b8", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "http://localhost:8888/status\n", - "{\n", - " \"version\": \"24.0.0\",\n", - " \"modules\": [\n", - " {\n", - " \"name\": \"org.apache.druid.common.gcp.GcpModule\",\n", - " \"artifact\": \"druid-gcp-common\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.common.aws.AWSModule\",\n", - " \"artifact\": \"druid-aws-common\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.storage.hdfs.HdfsStorageDruidModule\",\n", - " \"artifact\": \"druid-hdfs-storage\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.indexing.kafka.KafkaIndexTaskModule\",\n", - " \"artifact\": \"druid-kafka-indexing-service\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.query.aggregation.datasketches.theta.SketchModule\",\n", - " \"artifact\": \"druid-datasketches\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.query.aggregation.datasketches.theta.oldapi.OldApiSketchModule\",\n", - " \"artifact\": \"druid-datasketches\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.query.aggregation.datasketches.quantiles.DoublesSketchModule\",\n", - " \"artifact\": \"druid-datasketches\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule\",\n", - " \"artifact\": \"druid-datasketches\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.query.aggregation.datasketches.hll.HllSketchModule\",\n", - " \"artifact\": \"druid-datasketches\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.msq.guice.MSQExternalDataSourceModule\",\n", - " \"artifact\": \"druid-multi-stage-query\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.msq.guice.MSQIndexingModule\",\n", - " \"artifact\": \"druid-multi-stage-query\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.msq.guice.MSQDurableStorageModule\",\n", - " \"artifact\": \"druid-multi-stage-query\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.msq.guice.MSQServiceClientModule\",\n", - " \"artifact\": \"druid-multi-stage-query\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.msq.guice.MSQSqlModule\",\n", - " \"artifact\": \"druid-multi-stage-query\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.msq.guice.SqlTaskModule\",\n", - " \"artifact\": \"druid-multi-stage-query\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.data.input.protobuf.ProtobufExtensionsModule\",\n", - " \"artifact\": \"druid-protobuf-extensions\",\n", - " \"version\": \"24.0.0\"\n", - " },\n", - " {\n", - " \"name\": \"org.apache.druid.data.input.avro.AvroExtensionsModule\",\n", - " \"artifact\": \"druid-avro-extensions\",\n", - " \"version\": \"24.0.0\"\n", - " }\n", - " ],\n", - " \"memory\": {\n", - " \"maxMemory\": 134217728,\n", - " \"totalMemory\": 134217728,\n", - " \"freeMemory\": 88157184,\n", - " \"usedMemory\": 46060544,\n", - " \"directMemory\": 134217728\n", - " }\n", - "}\n" - ] - } - ], + "outputs": [], "source": [ "endpoint = \"/status\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"GET\"\n", "\n", "payload = {}\n", "headers = {}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(response.json(), indent=4))" + "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(), indent=4))" ] }, { @@ -276,7 +163,7 @@ "source": [ "# GET \n", "endpoint = \"/status/health\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"GET\"\n", "\n", "payload = {}\n", @@ -284,7 +171,7 @@ "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", "\n", - "print(response.text)" + "print(\"\\033[1mResponse\\033[0m: \" + response.text)" ] }, { @@ -315,23 +202,13 @@ }, { "cell_type": "code", - "execution_count": 79, + "execution_count": null, "id": "362b6a87", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "http://localhost:8888/druid/v2/sql/task\n", - "{\"taskId\":\"query-e6ee8e33-9d9a-4b8d-b54e-54978be36b2c\",\"state\":\"RUNNING\"}\n", - "Inserting data into the table named wikipedia_api.\n" - ] - } - ], + "outputs": [], "source": [ "endpoint = \"/druid/v2/sql/task\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"POST\"\n", "\n", "\n", @@ -348,8 +225,10 @@ "headers = {'Content-Type': 'application/json'}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "ingestiion_taskId_response = response\n", - "print(response.text + f\"\\nInserting data into the table named {dataSourceName}.\")" + "ingestion_taskId_response = response\n", + "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(f\"\\nInserting data into the table named {dataSourceName}\")\n", + "print(\"\\nThe response includes the task ID and the status: \" + response.text + \".\")" ] }, { @@ -364,21 +243,13 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": null, "id": "f578b9b2", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query-3e7b7c8f-0014-425d-a7a6-495f6a876819\n" - ] - } - ], + "outputs": [], "source": [ - "ingestion_taskId = json.loads(ingestiion_taskId_response.text)['taskId']\n", - "print(ingestion_taskId)" + "ingestion_taskId = json.loads(ingestion_taskId_response.text)['taskId']\n", + "print(f\"This is the task ID: {ingestion_taskId}\")" ] }, { @@ -399,45 +270,15 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": null, "id": "fdbab6ae", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "http://localhost:8888/druid/indexer/v1/task/query-3e7b7c8f-0014-425d-a7a6-495f6a876819/status\n", - "The ingestion is complete:\n", - "{\n", - " \"task\": \"query-3e7b7c8f-0014-425d-a7a6-495f6a876819\",\n", - " \"status\": {\n", - " \"id\": \"query-3e7b7c8f-0014-425d-a7a6-495f6a876819\",\n", - " \"groupId\": \"query-3e7b7c8f-0014-425d-a7a6-495f6a876819\",\n", - " \"type\": \"query_controller\",\n", - " \"createdTime\": \"2022-11-23T00:10:00.529Z\",\n", - " \"queueInsertionTime\": \"1970-01-01T00:00:00.000Z\",\n", - " \"statusCode\": \"SUCCESS\",\n", - " \"status\": \"SUCCESS\",\n", - " \"runnerStatusCode\": \"WAITING\",\n", - " \"duration\": 97332,\n", - " \"location\": {\n", - " \"host\": \"localhost\",\n", - " \"port\": 8100,\n", - " \"tlsPort\": -1\n", - " },\n", - " \"dataSource\": \"wikipedia_api3\",\n", - " \"errorMsg\": null\n", - " }\n", - "}\n" - ] - } - ], + "outputs": [], "source": [ "import time\n", "\n", "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"GET\"\n", "\n", "\n", @@ -490,29 +331,20 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": null, "id": "959e3c9b", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "http://localhost:8888/druid/coordinator/v1/datasources\n", - "[\"transactions\",\"wikipedia\",\"wikipedia-kafka\",\"wikipedia_api\"]\n" - ] - } - ], + "outputs": [], "source": [ "endpoint = \"/druid/coordinator/v1/datasources\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"GET\"\n", "\n", "payload = {}\n", "headers = {}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(response.text)" + "print(\"\\nThe response is the list of datasources available in your Druid deployment: \" + response.text)" ] }, { @@ -529,104 +361,13 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": null, "id": "694900d0-891f-41bd-9b45-5ae957385244", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "http://localhost:8888/druid/v2/sql\n", - "[\n", - " {\n", - " \"__time\": \"2016-06-27T00:00:11.080Z\",\n", - " \"added\": 31,\n", - " \"channel\": \"#sv.wikipedia\",\n", - " \"cityName\": \"\",\n", - " \"comment\": \"Botskapande Indonesien omdirigering\",\n", - " \"commentLength\": 35,\n", - " \"countryIsoCode\": \"\",\n", - " \"countryName\": \"\",\n", - " \"deleted\": 0,\n", - " \"delta\": 31,\n", - " \"deltaBucket\": \"0.0\",\n", - " \"diffUrl\": \"https://sv.wikipedia.org/w/index.php?oldid=36099284&rcid=89369918\",\n", - " \"flags\": \"NB\",\n", - " \"isAnonymous\": \"false\",\n", - " \"isMinor\": \"false\",\n", - " \"isNew\": \"true\",\n", - " \"isRobot\": \"true\",\n", - " \"isUnpatrolled\": \"false\",\n", - " \"metroCode\": \"\",\n", - " \"namespace\": \"Main\",\n", - " \"page\": \"Salo Toraut\",\n", - " \"regionIsoCode\": \"\",\n", - " \"regionName\": \"\",\n", - " \"timestamp\": \"2016-06-27T00:00:11.080Z\",\n", - " \"user\": \"Lsjbot\"\n", - " },\n", - " {\n", - " \"__time\": \"2016-06-27T00:00:17.457Z\",\n", - " \"added\": 125,\n", - " \"channel\": \"#ja.wikipedia\",\n", - " \"cityName\": \"\",\n", - " \"comment\": \"70\\u5e74\\u4ee3\",\n", - " \"commentLength\": 4,\n", - " \"countryIsoCode\": \"\",\n", - " \"countryName\": \"\",\n", - " \"deleted\": 0,\n", - " \"delta\": 125,\n", - " \"deltaBucket\": \"100.0\",\n", - " \"diffUrl\": \"https://ja.wikipedia.org/w/index.php?diff=60239890&oldid=60239620\",\n", - " \"flags\": \"\",\n", - " \"isAnonymous\": \"false\",\n", - " \"isMinor\": \"false\",\n", - " \"isNew\": \"false\",\n", - " \"isRobot\": \"false\",\n", - " \"isUnpatrolled\": \"false\",\n", - " \"metroCode\": \"\",\n", - " \"namespace\": \"\\u5229\\u7528\\u8005\",\n", - " \"page\": \"\\u5229\\u7528\\u8005:\\u30ef\\u30fc\\u30ca\\u30fc\\u6210\\u5897/\\u653e\\u9001\\u30a6\\u30fc\\u30de\\u30f3\\u8cde\",\n", - " \"regionIsoCode\": \"\",\n", - " \"regionName\": \"\",\n", - " \"timestamp\": \"2016-06-27T00:00:17.457Z\",\n", - " \"user\": \"\\u30ef\\u30fc\\u30ca\\u30fc\\u6210\\u5897\"\n", - " },\n", - " {\n", - " \"__time\": \"2016-06-27T00:00:34.959Z\",\n", - " \"added\": 2,\n", - " \"channel\": \"#en.wikipedia\",\n", - " \"cityName\": \"Buenos Aires\",\n", - " \"comment\": \"/* Scores */\",\n", - " \"commentLength\": 12,\n", - " \"countryIsoCode\": \"AR\",\n", - " \"countryName\": \"Argentina\",\n", - " \"deleted\": 0,\n", - " \"delta\": 2,\n", - " \"deltaBucket\": \"0.0\",\n", - " \"diffUrl\": \"https://en.wikipedia.org/w/index.php?diff=727144213&oldid=727144184\",\n", - " \"flags\": \"\",\n", - " \"isAnonymous\": \"true\",\n", - " \"isMinor\": \"false\",\n", - " \"isNew\": \"false\",\n", - " \"isRobot\": \"false\",\n", - " \"isUnpatrolled\": \"false\",\n", - " \"metroCode\": \"\",\n", - " \"namespace\": \"Main\",\n", - " \"page\": \"Bailando 2015\",\n", - " \"regionIsoCode\": \"C\",\n", - " \"regionName\": \"Buenos Aires F.D.\",\n", - " \"timestamp\": \"2016-06-27T00:00:34.959Z\",\n", - " \"user\": \"181.230.118.178\"\n", - " }\n", - "]\n" - ] - } - ], + "outputs": [], "source": [ "endpoint = \"/druid/v2/sql\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"POST\"\n", "\n", "payload = json.dumps({\n", @@ -636,7 +377,9 @@ "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", "\n", - "print(json.dumps(response.json(), indent=4))\n", + "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(f\"\\nEach JSON object in the response represents a row in the {dataSourceName} datasource.\") \n", + "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(), indent=4))\n", "\n" ] }, @@ -668,7 +411,7 @@ "outputs": [], "source": [ "endpoint = \"/druid/v2/sql\"\n", - "print(druid_host+endpoint)\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"POST\"\n", "\n", "payload = json.dumps({\n", @@ -683,7 +426,8 @@ "headers = {'Content-Type': 'application/json'}\n", "\n", "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", - "print(json.dumps(response.json(), indent=4))" + "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(), indent=4))\n" ] }, { @@ -723,7 +467,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.15" + "version": "3.10.6" }, "vscode": { "interpreter": { From 95930afeeffa2dfed06868dec69d9b82058be3a8 Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Wed, 7 Dec 2022 10:46:18 -0800 Subject: [PATCH 6/7] address feedback from will --- .../jupyter-notebooks/api-tutorial.ipynb | 27 ++++++++++++++----- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb index e5031aa3aa63..ae4e5d1648e2 100644 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -44,7 +44,7 @@ "- [Get basic cluster information](#Get-basic-cluster-information)\n", "- [Ingest data](#Ingest-data)\n", "- [Query your data](#Query-your-data)\n", - "- [Next steps](#Next-steps)\n", + "- [Learn more](#Learn-more)\n", "\n", "For the best experience, use JupyterLab so that you can always access the table of contents." ] @@ -94,7 +94,7 @@ "import json\n", "\n", "# druid_host is the hostname and port for your Druid deployment. \n", - "# In a distributed environment, use the Router service as the `druid_host`. \n", + "# In a distributed environment, you can point to other Druid services. In this tutorial, you'll use the Router service as the `druid_host`. \n", "druid_host = \"http://localhost:8888\"\n", "dataSourceName = \"wikipedia_api\"\n", "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")" @@ -195,9 +195,23 @@ "- Includes a payload that inserts data from an external source into a table named wikipedia_api. The payload is in JSON format and included in the code directly. You can also store it in a file and provide the file. \n", "- Saves the response to a unique variable that you can reference later to identify this ingestion task\n", "\n", - "The example uses INSERT, but you could also use REPLACE. \n", + "The example uses INSERT, but you could also use REPLACE INTO. In fact, if you have an existing datasource with the name `wikipedia_api`, you need to use REPLACE INTO instead. \n", "\n", - "The MSQ task engine uses a task to ingest data. The response for the API includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it." + "The MSQ task engine uses a task to ingest data. The response for the API includes a `taskId` and `state` for your ingestion. You can use this `taskId` to reference this task later on to get more information about it.\n", + "\n", + "Before you ingest the data, take a look at the query. Pay attention to two parts of it, `__time` and `PARTITIONED BY`, which relate to how Druid partitions data:\n", + "\n", + "- **`__time`**\n", + "\n", + " The `__time` column is a key concept for Druid. It's the default partition for Druid and is treated as the primary timestamp. Use it to help you write faster and more efficient queries. Big datasets, such as those for event data, typically have a time component. This means that instead of writing a query using only `COUNT`, you can combine that with `WHERE __time` to return results much more quickly.\n", + "\n", + "- **`PARTITIONED BY DAY`**\n", + "\n", + " If you partition by day, Druid creates segment files within the partition based on the day. You can only replace, delete and update data at the partition level. So when you're deciding how to partition data, make the partition large enough (min 500,000 rows) for good performance but not so big that those operations become impractical to run.\n", + "\n", + "To learn more, see [Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html).\n", + "\n", + "Now, run the next cell to start the ingestion." ] }, { @@ -212,6 +226,7 @@ "http_method = \"POST\"\n", "\n", "\n", + "# The query uses INSERT INTO. If you have an existing datasource with the name wikipedia_api, use REPLACE INTO instead.\n", "payload = json.dumps({\n", "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", " AS __time, * FROM TABLE \\\n", @@ -354,7 +369,7 @@ "tags": [] }, "source": [ - "### Query your data\n", + "### SELECT data\n", "\n", "Now, you can query the data. Because this tutorial is running in Jupyter, make sure to limit the size of your query results using `LIMIT`. For example, the following cell selects all columns but limits the results to 3 rows for display purposes because each row is a JSON object. In actual use cases, you'll want to only select the rows that you need. For more information about the kinds of things you can do, see [Druid SQL](https://druid.apache.org/docs/latest/querying/sql.html).\n" ] @@ -437,7 +452,7 @@ "tags": [] }, "source": [ - "## Next steps\n", + "## Learn more\n", "\n", "This tutorial covers the some of the basics related to the Druid API. To learn more about the kinds of things you can do, see the API documentation:\n", "\n", From ab5dcd7a8b9b9b4969095bed082390e7170fae8c Mon Sep 17 00:00:00 2001 From: "brian.le" Date: Fri, 9 Dec 2022 10:00:01 -0800 Subject: [PATCH 7/7] delete unnecessary comment --- examples/quickstart/jupyter-notebooks/api-tutorial.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb index ae4e5d1648e2..b795babaefef 100644 --- a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb @@ -161,7 +161,6 @@ "metadata": {}, "outputs": [], "source": [ - "# GET \n", "endpoint = \"/status/health\"\n", "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", "http_method = \"GET\"\n",