Skip to content

docs: notebook only for API tutorial#13345

Merged
techdocsmith merged 7 commits intoapache:masterfrom
317brian:add-api-tutorial-nb-only
Dec 15, 2022
Merged

docs: notebook only for API tutorial#13345
techdocsmith merged 7 commits intoapache:masterfrom
317brian:add-api-tutorial-nb-only

Conversation

@317brian
Copy link
Copy Markdown
Contributor

This breaks up #13342 into 2 PRs if we want to take that route. It contains only the notebook and .gitignore. If this gets merged before 13342, the download links in that PR should work.

This PR has:

  • been self-reviewed.

Copy link
Copy Markdown
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great introduction! Thank you. Minor suggestions here and there.

"- Querying data\n",
"- Deleting data\n",
"\n",
"In a Druid deployment, you have [Mastery, Query, and Data servers](https://druid.apache.org/docs/latest/design/processes.html#server-types) that all fulfill different purposes. The endpoint you use for a certain action is determined, partially, by which server governs that part of Druid and the processes that run on that server type. That's why the [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical) is organized by server type and process.\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pare this down a little more/make it specific to the APIs in this tutorial... Along the lines:

Different Druid server types are responsible for handling different APIs for various services. This tutorial introduces x, y, and z. For more information, see [link to APIs] organized by server type.

Also typo: "Mastery"

Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated
Comment thread examples/quickstart/jupyter-notebooks/api-tutorial.ipynb
"- [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
"- [API reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n",
"\n",
"You can also try out the [druid-client](https://github.com/paul-rogers/druid-client), a Python library for Druid created by a Druid contributor.\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mention "Druid contributor" Paul Rogers explicitly here.

@317brian 317brian changed the title docs: notebook for API tutorial docs: notebook only for API tutorial Nov 22, 2022
@techdocsmith
Copy link
Copy Markdown
Contributor

Suggest shortening:
print(json.dumps(json.loads(response.text), indent=4))
to
print(json.dumps(response.json(), indent=4))
where applicable

317brian and others added 3 commits November 22, 2022 15:12
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
@317brian 317brian requested review from techdocsmith and removed request for paul-rogers November 28, 2022 20:32
Copy link
Copy Markdown
Contributor

@2bethere 2bethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tutorial! I cannot wait for this!
This is definitely going to be one of the highlights of the next release containing it.

Thanks again for putting this together.

"import requests\n",
"import json\n",
"\n",
"# druid_host is the hostname and port for your Druid deployment. \n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# druid_host is the hostname and port for your Druid deployment. \n",
"# druid_host is the hostname and port for your Druid deployment. \n By default, Druid runs on localhost port 8888. You can modify this to point to other Druid service locations.",

"\n",
"\n",
"payload = json.dumps({\n",
"\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the scenario we need to account for is a failed ingest or users are continuing this tutorial. Adding a comment maybe highlight that you see the datasource already exist, use "REPLACE INTO" instead.

"tags": []
},
"source": [
"### Query your data\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be repeat of the title "Query your data" above?

"Druid supports dynamic parameters, so you can either define certain parameters within the query explicitly or insert a `?` as a placeholder and define it in a parameters block. In the following cell, the `?` gets bound to the timestmap value of `2016-06-27` at execution time. For more information, see [Dynamic parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n",
"\n",
"\n",
"The following cell selects rows where the `__time` column contains a value greater than the value defined dynamically in `parameters` and sets a custom `sqlQueryId`."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a section to talk about what the heck is __time and why it matters to people.

This is a super unique Druid concept that most other databases doesn't have.

I think the important message here is:

  1. Big dataset almost always have time associated with it, for example event data
  2. Druid partitions your data by __time by default. This speeds up queries if you use WHERE __time.... filter. When you are designing applications, you should think about this.
  3. Traditional queries like count() can be slow, because that's by design. Instead, do count() where __time.... and you'll have speedier application.

"\"query\": \"INSERT INTO wikipedia_api SELECT TIME_PARSE(\\\"timestamp\\\") \\\n",
" AS __time, * FROM TABLE \\\n",
" (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": [\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
" PARTITIONED BY DAY\",\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explain what does "PARTITIONED BY DAY" means. Or specifically, if you partition by day, Druid creates segment files within the partition, and you can only replace, delete and update data at a partition level. You'll want to keep the partition large enough (min 500,000 rows) for good performance, but not so big that those operations becomes impossible.

@317brian 317brian requested review from 2bethere and removed request for techdocsmith December 7, 2022 18:46
Copy link
Copy Markdown
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@techdocsmith techdocsmith merged commit 668d1fa into apache:master Dec 15, 2022
@techdocsmith techdocsmith deleted the add-api-tutorial-nb-only branch December 15, 2022 21:16
vtlim pushed a commit to vtlim/druid that referenced this pull request Dec 16, 2022
* docs: notebook for API tutorial

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* address the other comments

* typo

* add commentary to outputs

* address feedback from will

* delete unnecessary comment

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
@kfaraz kfaraz added this to the 25.0 milestone Dec 17, 2022
kfaraz pushed a commit that referenced this pull request Dec 17, 2022
* docs: add index page and related stuff for jupyter tutorials (#13342)

* docs: notebook only for API tutorial (#13345)

* docs: notebook for API tutorial

* Apply suggestions from code review

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

* address the other comments

* typo

* add commentary to outputs

* address feedback from will

* delete unnecessary comment

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants