msq: add multi-stage-query docs by 317brian · Pull Request #12983 · apache/druid

317brian · 2022-08-26T22:59:49Z

This PR adds all the documentation updates for the multi-stage query architecture and the MSQ task engine

Note that the screenshot on the Druid console page is outdated and will get updated in a subsequent PR.

This PR has:

been self-reviewed.

add back theta sketches tutoria change filename fix filename fix link fix headings

317brian · 2022-08-26T23:19:18Z

@2bethere @gianm @vogievetsky please review the OSS docs as a whole for MSQ

@cryptoe please review the known issues list to see if there's anything to be added or removed: https://github.com/apache/druid/pull/12983/files#diff-e83f8a116cd5e34642cfe3474e915dd586d4679547ac393a6dea654a78c9bbec

cryptoe

LGTM

cryptoe · 2022-08-29T17:27:27Z

+
+### PARTITIONED BY
+
+INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITION BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead.


Should we make 5 million bold ?

cryptoe · 2022-08-29T17:57:56Z

+- EXTERN does not accept `druid` input sources.
+
+## Missing guardrails
+


Maximum number of input files. No guardrail today means the controller can potentially run out of memory tracking them all.

vogievetsky · 2022-08-30T00:31:23Z

+
+> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.
+
+By default, the multi-stage query task engine (MSQ task engine) uses the local storage of a node to store data from intermediate steps when executing a query. Although this method provides better speed when executing a query, the data is lost if the node encounters an issue. When you enable durable storage, intermediate data is stored in Amazon S3 instead. Using this feature can improve the reliability of queries that use more than 20 workers. In essence, you trade some performance for better reliability. This is especially useful for long-running queries.


Where does the "20 workers" language come from?

vogievetsky · 2022-08-30T21:47:37Z

+3.  A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer.
+
+### INSERT
+


I think somewhere here we should call out a note about how the MSQ INSERT syntax (REPLACE also, but this has more impact for INSERT) deviates from the SQL standard in that the column are mapped by name and not positionally.

Maybe something like:

Please note that unlike standard SQL the data is inserted according to column name and not positionally which means that it is important to get the output column names of subsequent inserts to be the same as the table and not to simply reply on their positions within the SELECT clause.

vogievetsky · 2022-08-30T22:01:47Z

+
+|Field|Description|
+|-----|-----------|
+ | taskId | Controller task ID. You can use Druid's standard [task APIs](../operations/api-reference.md#overlord) to interact with this controller task.|


is there a strange indent here or is the GitHub UI buggy

vogievetsky · 2022-08-30T22:08:21Z

+
+Currently, the MSQ task engine ignores the provided values of `resultFormat`, `header`,
+`typesHeader`, and `sqlTypesHeader`. SQL SELECT queries always behave as if `resultFormat` is an array, `header` is
+true, `typesHeader` is true, and `sqlTypesHeader` is true.


I would change

SQL SELECT queries always behave as if `resultFormat` is an array, `header` is true, `typesHeader` is true, and `sqlTypesHeader` is true.

To

SQL SELECT queries write out their results into the task report (in the `multiStageQuery.payload.results.results` key) formatted as if `resultFormat` is an `array`.

techdocsmith

LGTM

finish sentence

vogievetsky · 2022-09-01T01:40:53Z

+To interact with a query through the Overlord API, you need the following permissions:
+
+- INSERT or REPLACE queries: You must have READ DATASOURCE permission on the output datasource.
+- SELECT queries: You must have read permissions on the `__query_select` datasource, which is a stub datasource that gets created.


This should read ... that does not get created.

vogievetsky · 2022-09-01T23:33:12Z

      "ingestion/tasks",
      "ingestion/faq"
    ],
+  "SQL-based ingestion": [


this needs two spaces of leading indent

317brian · 2022-09-06T15:37:54Z

The CI is failing on the licensing check and JDK 8 packaging check:

abhishekagarwal87 · 2022-09-06T15:42:45Z

@317brian - You need to add the copyright header in text files.

* msq: add multi-stage-query docs * add screenshots add back theta sketches tutoria change filename fix filename fix link fix headings * fixes * fixes * fix spelling issues and update spell file * address feedback from karan * add missing guardrail to known issues * update blurb * fix typo * remove durable storage info * update titles * Restore en.json * Update query view * address comments from vad * Update docs/multi-stage-query/msq-known-issues.md finish sentence * add apache license to docs * add apache license to docs Co-authored-by: Katya Macedo <katya.macedo@imply.io> Co-authored-by: Charles Smith <techdocsmith@gmail.com>

msq: add multi-stage-query docs

b962836

vtlim added the Area - Documentation label Aug 26, 2022

add screenshots

748ab34

add back theta sketches tutoria change filename fix filename fix link fix headings

317brian force-pushed the merge-msq-docs branch from 573436e to 748ab34 Compare August 26, 2022 23:16

317brian added 3 commits August 26, 2022 19:40

fixes

7437649

fixes

9226e93

fix spelling issues and update spell file

9f4d10e

abhishekagarwal87 added this to the 24.0.0 milestone Aug 29, 2022

address feedback from karan

f938d36

cryptoe approved these changes Aug 29, 2022

View reviewed changes

317brian added 2 commits August 29, 2022 12:09

add missing guardrail to known issues

2bd1fa1

update blurb

586fd65

317brian commented Aug 29, 2022

View reviewed changes

Comment thread docs/multi-stage-query/msq-tutorial-connect-external-data.md Outdated

fix typo

0784e0b

gianm added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Aug 29, 2022

vogievetsky reviewed Aug 30, 2022

View reviewed changes

317brian and others added 3 commits August 30, 2022 11:01

remove durable storage info

bbdff97

update titles

3a073d2

Restore en.json

e3a662c

vogievetsky reviewed Aug 30, 2022

View reviewed changes

ektravel and others added 2 commits August 30, 2022 15:48

Update query view

5cb7459

address comments from vad

6277322

techdocsmith mentioned this pull request Aug 30, 2022

Nested columns documentation #12946

Merged

1 task

Merge pull request #1 from 317brian/query-view-update

bda5aec

317brian requested a review from vogievetsky August 31, 2022 15:40

techdocsmith approved these changes Aug 31, 2022

View reviewed changes

Comment thread docs/multi-stage-query/msq-known-issues.md Outdated

Update docs/multi-stage-query/msq-known-issues.md

f53e390

finish sentence

vogievetsky reviewed Sep 1, 2022

View reviewed changes

317brian added 2 commits September 6, 2022 08:45

add apache license to docs

43d645d

add apache license to docs

88c9b1f

vogievetsky approved these changes Sep 6, 2022

View reviewed changes

abhishekagarwal87 merged commit d4233ef into apache:master Sep 6, 2022


		### PARTITIONED BY

		INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITION BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead.

		- EXTERN does not accept `druid` input sources.

		## Missing guardrails


		> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.

		By default, the multi-stage query task engine (MSQ task engine) uses the local storage of a node to store data from intermediate steps when executing a query. Although this method provides better speed when executing a query, the data is lost if the node encounters an issue. When you enable durable storage, intermediate data is stored in Amazon S3 instead. Using this feature can improve the reliability of queries that use more than 20 workers. In essence, you trade some performance for better reliability. This is especially useful for long-running queries.

		3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer.

		### INSERT

Conversation

317brian commented Aug 26, 2022

Uh oh!

317brian commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cryptoe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

techdocsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

317brian commented Sep 6, 2022

Uh oh!

abhishekagarwal87 commented Sep 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

317brian commented Aug 26, 2022 •

edited

Loading