docs: notebook only for API tutorial by 317brian · Pull Request #13345 · apache/druid

317brian · 2022-11-10T00:12:00Z

This breaks up #13342 into 2 PRs if we want to take that route. It contains only the notebook and .gitignore. If this gets merged before 13342, the download links in that PR should work.

This PR has:

been self-reviewed.

techdocsmith

This is a great introduction! Thank you. Minor suggestions here and there.

techdocsmith · 2022-11-22T20:23:24Z

+    "- Querying data\n",
+    "- Deleting data\n",
+    "\n",
+    "In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n",


Could we pare this down a little more/make it specific to the APIs in this tutorial... Along the lines:

Different Druid server types are responsible for handling different APIs for various services. This tutorial introduces x, y, and z. For more information, see [link to APIs] organized by server type.

Also typo: "Mastery"

techdocsmith · 2022-11-22T22:08:55Z

+    "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
+    "- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n",
+    "\n",
+    "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n",


I think mention "Druid contributor" Paul Rogers explicitly here.

techdocsmith · 2022-11-22T23:12:10Z

Suggest shortening:
print(json.dumps(json.loads(response.text), indent=4))
to
print(json.dumps(response.json(), indent=4))
where applicable

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

2bethere

Awesome tutorial! I cannot wait for this!
This is definitely going to be one of the highlights of the next release containing it.

Thanks again for putting this together.

2bethere · 2022-11-30T00:39:01Z

+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",


Suggested change

"# druid_host is the hostname and port for your Druid deployment. \n",

"# druid_host is the hostname and port for your Druid deployment. \n By default, Druid runs on localhost port 8888. You can modify this to point to other Druid service locations.",

2bethere · 2022-12-04T22:03:39Z

+    "\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n",


One of the scenario we need to account for is a failed ingest or users are continuing this tutorial. Adding a comment maybe highlight that you see the datasource already exist, use "REPLACE INTO" instead.

2bethere · 2022-12-04T22:11:19Z

+    "tags": []
+   },
+   "source": [
+    "### Query your data\n",


This seems to be repeat of the title "Query your data" above?

2bethere · 2022-12-04T22:13:42Z

+    "Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n",
+    "\n",
+    "\n",
+    "The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`."


I think we need a section to talk about what the heck is __time and why it matters to people.

This is a super unique Druid concept that most other databases doesn't have.

I think the important message here is:

Big dataset almost always have time associated with it, for example event data

Druid partitions your data by __time by default. This speeds up queries if you use WHERE __time.... filter. When you are designing applications, you should think about this.

Traditional queries like count() can be slow, because that's by design. Instead, do count() where __time.... and you'll have speedier application.

2bethere · 2022-12-04T22:15:23Z

+    "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n",
+    "          AS __time, * FROM TABLE \\\n",
+    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
+    "          PARTITIONED BY DAY\",\n",


I think we should explain what does "PARTITIONED BY DAY" means. Or specifically, if you partition by day, Druid creates segment files within the partition, and you can only replace, delete and update data at a partition level. You'll want to keep the partition large enough (min 500,000 rows) for good performance, but not so big that those operations becomes impossible.

techdocsmith

LGTM

* docs: notebook for API tutorial * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * address the other comments * typo * add commentary to outputs * address feedback from will * delete unnecessary comment Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* docs: add index page and related stuff for jupyter tutorials (#13342) * docs: notebook only for API tutorial (#13345) * docs: notebook for API tutorial * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * address the other comments * typo * add commentary to outputs * address feedback from will * delete unnecessary comment Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>

docs: notebook for API tutorial

308ae7f

317brian mentioned this pull request Nov 10, 2022

docs: add index page and related stuff for jupyter tutorials #13342

Merged

1 task

techdocsmith added the Area - Documentation label Nov 10, 2022

techdocsmith requested changes Nov 22, 2022

View reviewed changes

techdocsmith reviewed Nov 22, 2022

View reviewed changes

Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb

techdocsmith requested changes Nov 22, 2022

View reviewed changes

techdocsmith requested a review from paul-rogers November 22, 2022 22:09

317brian changed the title ~~docs: notebook for API tutorial~~ docs: notebook only for API tutorial Nov 22, 2022

317brian and others added 3 commits November 22, 2022 15:12

Apply suggestions from code review

e4eb60a

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

address the other comments

c0e19ef

typo

915c5a9

317brian requested review from techdocsmith and removed request for paul-rogers November 28, 2022 20:32

add commentary to outputs

5594ec6

2bethere reviewed Dec 4, 2022

View reviewed changes

address feedback from will

95930af

317brian requested review from 2bethere and removed request for techdocsmith December 7, 2022 18:46

delete unnecessary comment

ab5dcd7

techdocsmith approved these changes Dec 15, 2022

View reviewed changes

techdocsmith merged commit 668d1fa into apache:master Dec 15, 2022

techdocsmith deleted the add-api-tutorial-nb-only branch December 15, 2022 21:16

vtlim mentioned this pull request Dec 17, 2022

[Backport] docs: Jupyter tutorial #13589

Merged

kfaraz added this to the 25.0 milestone Dec 17, 2022

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

	"# druid_host is the hostname and port for your Druid deployment. \n",
	"# druid_host is the hostname and port for your Druid deployment. \n By default, Druid runs on localhost port 8888. You can modify this to point to other Druid service locations.",

Conversation

317brian commented Nov 10, 2022

Uh oh!

techdocsmith left a comment

Choose a reason for hiding this comment

Uh oh!

techdocsmith Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

techdocsmith Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

techdocsmith commented Nov 22, 2022

Uh oh!

2bethere left a comment

Choose a reason for hiding this comment

Uh oh!

2bethere Nov 30, 2022

Choose a reason for hiding this comment

Uh oh!

2bethere Dec 4, 2022

Choose a reason for hiding this comment

Uh oh!

2bethere Dec 4, 2022

Choose a reason for hiding this comment

Uh oh!

2bethere Dec 4, 2022

Choose a reason for hiding this comment

Uh oh!

2bethere Dec 4, 2022

Choose a reason for hiding this comment

Uh oh!

techdocsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants