docs: notebook only for API tutorial#13345
Conversation
techdocsmith
left a comment
There was a problem hiding this comment.
This is a great introduction! Thank you. Minor suggestions here and there.
| "- Querying data\n", | ||
| "- Deleting data\n", | ||
| "\n", | ||
| "In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n", |
There was a problem hiding this comment.
Could we pare this down a little more/make it specific to the APIs in this tutorial... Along the lines:
Different Druid server types are responsible for handling different APIs for various services. This tutorial introduces x, y, and z. For more information, see [link to APIs] organized by server type.
Also typo: "Mastery"
| "- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n", | ||
| "- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n", | ||
| "\n", | ||
| "You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n", |
There was a problem hiding this comment.
I think mention "Druid contributor" Paul Rogers explicitly here.
|
Suggest shortening: |
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2bethere
left a comment
There was a problem hiding this comment.
Awesome tutorial! I cannot wait for this!
This is definitely going to be one of the highlights of the next release containing it.
Thanks again for putting this together.
| "import requests\n", | ||
| "import json\n", | ||
| "\n", | ||
| "# druid_host is the hostname and port for your Druid deployment. \n", |
There was a problem hiding this comment.
| "# druid_host is the hostname and port for your Druid deployment. \n", | |
| "# druid_host is the hostname and port for your Druid deployment. \n By default, Druid runs on localhost port 8888. You can modify this to point to other Druid service locations.", |
| "\n", | ||
| "\n", | ||
| "payload = json.dumps({\n", | ||
| "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", |
There was a problem hiding this comment.
One of the scenario we need to account for is a failed ingest or users are continuing this tutorial. Adding a comment maybe highlight that you see the datasource already exist, use "REPLACE INTO" instead.
| "tags": [] | ||
| }, | ||
| "source": [ | ||
| "### Query your data\n", |
There was a problem hiding this comment.
This seems to be repeat of the title "Query your data" above?
| "Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n", | ||
| "\n", | ||
| "\n", | ||
| "The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`." |
There was a problem hiding this comment.
I think we need a section to talk about what the heck is __time and why it matters to people.
This is a super unique Druid concept that most other databases doesn't have.
I think the important message here is:
- Big dataset almost always have time associated with it, for example event data
- Druid partitions your data by __time by default. This speeds up queries if you use WHERE __time.... filter. When you are designing applications, you should think about this.
- Traditional queries like count() can be slow, because that's by design. Instead, do count() where __time.... and you'll have speedier application.
| "\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n", | ||
| " AS __time, * FROM TABLE \\\n", | ||
| " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n", | ||
| " PARTITIONED BY DAY\",\n", |
There was a problem hiding this comment.
I think we should explain what does "PARTITIONED BY DAY" means. Or specifically, if you partition by day, Druid creates segment files within the partition, and you can only replace, delete and update data at a partition level. You'll want to keep the partition large enough (min 500,000 rows) for good performance, but not so big that those operations becomes impossible.
* docs: notebook for API tutorial * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * address the other comments * typo * add commentary to outputs * address feedback from will * delete unnecessary comment Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* docs: add index page and related stuff for jupyter tutorials (#13342) * docs: notebook only for API tutorial (#13345) * docs: notebook for API tutorial * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * address the other comments * typo * add commentary to outputs * address feedback from will * delete unnecessary comment Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com> Co-authored-by: Charles Smith <techdocsmith@gmail.com>
This breaks up #13342 into 2 PRs if we want to take that route. It contains only the notebook and .gitignore. If this gets merged before 13342, the download links in that PR should work.
This PR has: