From e21731ac30b984145a9c20dcdb0edbc3982e236b Mon Sep 17 00:00:00 2001 From: Katya Macedo Date: Fri, 16 Dec 2022 16:22:07 -0600 Subject: [PATCH] Add Partitioning tutorial --- .../partitioned-by-tutorial.ipynb | 428 ++++++++++++++++++ 1 file changed, 428 insertions(+) create mode 100644 examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb diff --git a/examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb new file mode 100644 index 000000000000..02028c3e0517 --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/partitioned-by-tutorial.ipynb @@ -0,0 +1,428 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ad4e60b6", + "metadata": { + "deletable": true, + "editable": true, + "tags": [] + }, + "source": [ + "# Tutorial: Druid SQL segment sizing and partitioning\n", + "\n", + "\n", + " \n", + "Partitioning is a method of organizing a large datasource into independent partitions.\n", + "Partitioning reduces the size of your data and increases query performance.\n", + "\n", + "At ingestion, Apache Druid always partitions its data by time.\n", + "Each time chunk is then divided into one or more [segments](https://druid.apache.org/docs/latest/design/segments.html).\n", + "\n", + "This tutorial describes how to configure partitioning for the Druid SQL ingestion method. For information about partitioning configurations supported by other ingestion methods, see [How to configure partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html#how-to-configure-partitioning)." + ] + }, + { + "cell_type": "markdown", + "id": "8d6bbbcb", + "metadata": { + "deletable": true, + "tags": [] + }, + "source": [ + "## Prerequisites\n", + "\n", + "Make sure that you meet the requirements outlined in the README.md file of the [apache/druid repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).\n", + "Specifically, you need the following:\n", + "- Knowledge of SQL\n", + "- [Python3](https://www.python.org/downloads/)\n", + "- [The `requests` package for Python](https://requests.readthedocs.io/en/latest/user/install/)\n", + "- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or [Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a non-default port. Druid and Jupyter both default to port `8888`, so you need to start Jupyter on a different port. \n", + "- An available Druid instance. This tutorial uses the `micro-quickstart` configuration described in the [Druid quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no authentication or authorization is required unless explicitly mentioned. If you haven’t already, download Druid version 24.0 or higher and start Druid services as described in the quickstart." + ] + }, + { + "cell_type": "markdown", + "id": "8f8e64f0-c29a-473c-8783-a2ff8648acd7", + "metadata": {}, + "source": [ + "## Prepare your environment\n", + "\n", + "Start by running the following cell. It imports the required Python packages and defines a variable for the Druid host." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7f08a52", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "\n", + "# druid_host is the hostname and port for your Druid deployment. \n", + "# In a distributed environment, use the Router service as the `druid_host`. \n", + "druid_host = \"http://localhost:8888\"\n", + "dataSourceName = \"partitioning-tutorial\"\n", + "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e893ef7d-7136-442f-8bd9-31b5a5276518", + "metadata": {}, + "source": [ + "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` variables are updated to accomplish different tasks." + ] + }, + { + "cell_type": "markdown", + "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550", + "metadata": {}, + "source": [ + "## Segment size\n", + "\n", + "A segment is the smallest unit of storage in Druid.\n", + "It is recommended that you optimize your segment file size at ingestion time for Druid to operate well under a heavy query load.\n", + "\n", + "Consider the following to optimize your segment file size:\n", + "\n", + "- The number of rows per segment should be around five million. You can set the number of rows per segment using the `rowsPerSegment` query context parameter in the [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html) or as a [JDBC connection properties object](https://druid.apache.org/docs/latest/querying/sql-jdbc.html). To specify the `rowsPerSegment` parameters in the Druid web console, navigate to the **Query** page, then click **Engine > Edit context** to bring up the **Edit query context** dialog. For more information on how to specify query context parameters, see [Setting the query context](https://druid.apache.org/docs/latest/querying/sql-query-context.html#setting-the-query-context).\n", + "- Segment file size should be within the range of 300-700 MB. The number of rows per segment takes precedence over the segment byte size. \n", + "\n", + "For more information on segment sizing, see [Segment size optimization](https://druid.apache.org/docs/latest/operations/segment-optimization.html)." + ] + }, + { + "cell_type": "markdown", + "id": "84cb68a0-beb1-47d5-9fd5-384ea0caa35d", + "metadata": {}, + "source": [ + "## PARTITIONED BY\n", + "\n", + "In Druid SQL, the granularity of a segment is defined by the granularity of the PARTITIONED BY clause.\n", + "\n", + "[INSERT](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#insert) and [REPLACE](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#replace) statements both require the PARTITIONED BY clause.\n", + "\n", + "PARTITIONED BY accepts the following time granularity arguments:\n", + "- `time_unit`\n", + "- `TIME_FLOOR(__time, period)` \n", + "- `FLOOR(__time TO time_unit)`\n", + "- `ALL` or `ALL TIME`\n", + "\n", + "Continue reading to learn about each of the supported arguments.\n", + "\n", + "### Time unit\n", + "\n", + "`PARTITIONED BY(time_unit)`. Partition by `SECOND`, `MINUTE`, `HOUR`, `DAY`, `WEEK`, `MONTH`, `QUARTER`, or `YEAR`.\n", + "\n", + "For example, run the following cell to ingest data from an external source into a table named `partitioning-tutorial` and partition the datasource by `DAY`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "045f782c-74d8-4447-9487-529071812b51", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql/task\"\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", + "http_method = \"POST\"\n", + "\n", + "# If you already have an existing datasource named partitioning-tutorial, use REPLACE INTO instead of INSERT INTO.\n", + "payload = json.dumps({\n", + "\"query\": \"INSERT INTO \\\"partitioning-tutorial\\\" SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", + " AS __time, * FROM TABLE \\\n", + " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n", + " PARTITIONED BY DAY\",\n", + " \"context\": {\n", + " \"maxNumTasks\": 3\n", + " }\n", + "})\n", + "\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestion_taskId_response = response\n", + "ingestion_taskId = json.loads(ingestion_taskId_response.text)['taskId']\n", + "\n", + "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(f\"\\nInserting data into the table named {dataSourceName}\")\n", + "print(\"\\nThe response includes the task ID and the status: \" + response.text + \".\")" + ] + }, + { + "cell_type": "markdown", + "id": "ceb86ce0-85f6-4c63-8fd6-883033ee96e9", + "metadata": {}, + "source": [ + "To check on the status of your ingestion task, run the following cell. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df12d12c-a067-4759-bae0-0410c24b6205", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import time\n", + "\n", + "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", + "http_method = \"GET\"\n", + "\n", + "payload = {}\n", + "headers = {}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestion_status = json.loads(response.text)['status']['status']\n", + "# If you only want to fetch the status once and print it, \n", + "# uncomment the print statement and comment out the if and while loops\n", + "# print(json.dumps(response.json(), indent=4))\n", + "\n", + "if ingestion_status == \"RUNNING\":\n", + " print(\"The ingestion is running...\")\n", + "\n", + "while ingestion_status != \"SUCCESS\":\n", + " response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + " ingestion_status = json.loads(response.text)['status']['status']\n", + " time.sleep(15) \n", + " \n", + "if ingestion_status == \"SUCCESS\": \n", + " print(\"The ingestion is complete:\")\n", + " print(json.dumps(response.json(), indent=4))" + ] + }, + { + "cell_type": "markdown", + "id": "240b0ad5-48f2-4737-b12b-5fd5f98da300", + "metadata": {}, + "source": [ + "### TIME_FLOOR\n", + "\n", + "`PARTITIONED BY(TIME_FLOOR(__time, period))`. Partition by a timestamp rounded to the specified period.\n", + "\n", + "`period` can be any of the following ISO 8601 periods:\n", + "- `PT1S`: one second\n", + "- `PT1M`: one minute\n", + "- `PT5M`: five minutes\n", + "- `PT10M`: ten minutes\n", + "- `PT15M`: fifteen minutes\n", + "- `PT30M`: thirty minutes\n", + "- `PT1H`: one hour\n", + "- `PT6H`: six hours\n", + "- `PT8H`: eight hours \n", + "- `P1D`: one day\n", + "- `P1W`: one week\n", + "- `P1M`: one month\n", + "- `P3M`: three months\n", + "- `P1Y`: one year\n", + "\n", + "Run the following cell to partition the `partitioning-tutorial` datasource by a timestamp rounded to thirty minutes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91dd255a-4d55-493e-a067-4cef5c659657", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql/task\"\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + "\"query\": \"REPLACE INTO \\\"partitioning-tutorial\\\" OVERWRITE ALL SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", + " AS __time, * FROM TABLE \\\n", + " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n", + " PARTITIONED BY TIME_FLOOR(__time, 'PT30M')\",\n", + " \"context\": {\n", + " \"maxNumTasks\": 3\n", + " }\n", + "})\n", + "\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestion_taskId_response = response\n", + "ingestion_taskId = json.loads(response.text)['taskId']\n", + "\n", + "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(f\"\\nInserting data into the table named {dataSourceName}\")\n", + "print(\"\\nThe response includes the task ID and the status: \" + response.text + \".\")" + ] + }, + { + "cell_type": "markdown", + "id": "cbeb5a63", + "metadata": { + "deletable": true, + "tags": [] + }, + "source": [ + "### FLOOR\n", + "\n", + "`PARTITIONED BY(FLOOR(__time TO time_unit))`. Partition by the largest timestamp value that is less than or equal to the specified time unit, where `time_unit` can be any of the following values: `SECOND`, `MINUTE`, `HOUR`, `DAY`, `WEEK`, `MONTH`, `QUARTER`, `YEAR`.\n", + "\n", + "Run the following cell to partition the `partitioning-tutorial` datasource by a timestamp value less than or equal to `HOUR`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9227d6c-1d8c-4169-b13b-a08625c4011f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql/task\"\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + "\"query\": \"REPLACE INTO \\\"partitioning-tutorial\\\" OVERWRITE ALL SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", + " AS __time, * FROM TABLE \\\n", + " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n", + " PARTITIONED BY FLOOR(__time TO HOUR)\",\n", + " \"context\": {\n", + " \"maxNumTasks\": 3\n", + " }\n", + "})\n", + "\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestion_taskId_response = response\n", + "ingestion_taskId = json.loads(response.text)['taskId']\n", + "\n", + "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(f\"\\nInserting data into the table named {dataSourceName}\")\n", + "print(\"\\nThe response includes the task ID and the status: \" + response.text + \".\")" + ] + }, + { + "cell_type": "markdown", + "id": "c59ca797-dd91-442b-8d02-67b711b3fcc6", + "metadata": {}, + "source": [ + "### ALL and ALL TIME\n", + "\n", + "`PARTITIONED BY ALL`. Disable time partitioning by placing all data in a single time chunk.\n", + "\n", + "PARTITIONED BY ALL and PARTITIONED BY ALL TIME clauses are suitable for datasets that do not have a primary timestamp. In this case, Druid creates a `__time` column in your Druid datasource and sets all timestamps to `1970-01-01T00:00:00Z`.\n", + "\n", + "> To use LIMIT or OFFSET at the outer level of your INSERT or REPLACE query, you must set PARTITIONED BY to ALL or ALL TIME.\n", + "\n", + "Run the following cell to skip time partitioning and place all data into a single time chunk:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7e3d62a-1325-4992-8bcd-c0f1925704bc", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"/druid/v2/sql/task\"\n", + "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n", + "http_method = \"POST\"\n", + "\n", + "payload = json.dumps({\n", + "\"query\": \"REPLACE INTO \\\"partitioning-tutorial\\\" OVERWRITE ALL SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", + " AS __time, * FROM TABLE \\\n", + " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n", + " PARTITIONED BY ALL\",\n", + " \"context\": {\n", + " \"maxNumTasks\": 3\n", + " }\n", + "})\n", + "\n", + "headers = {'Content-Type': 'application/json'}\n", + "\n", + "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n", + "ingestion_taskId_response = response\n", + "ingestion_taskId = json.loads(response.text)['taskId']\n", + "\n", + "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n", + "print(f\"\\nInserting data into the table named {dataSourceName}\")\n", + "print(\"\\nThe response includes the task ID and the status: \" + response.text + \".\")" + ] + }, + { + "cell_type": "markdown", + "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e", + "metadata": { + "deletable": true, + "editable": true, + "tags": [] + }, + "source": [ + "## Learn more\n", + "\n", + "To learn more about Druid segment sizing and partitioning, see the following topics:\n", + "\n", + "- [Segments](https://druid.apache.org/docs/latest/design/segments.html) for general information about segments in Druid. \n", + "- [Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html) to learn how to set up partitions within a single datasource.\n", + "- [Context parameters](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#context-parameters) for context parameters specific to the multi-stage query task engine." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.0" + }, + "toc-autonumbering": false, + "toc-showcode": false, + "toc-showmarkdowntxt": false, + "toc-showtags": false, + "vscode": { + "interpreter": { + "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}