Skip to content

msq: add multi-stage-query docs#12983

Merged
abhishekagarwal87 merged 18 commits intoapache:masterfrom
317brian:merge-msq-docs
Sep 6, 2022
Merged

msq: add multi-stage-query docs#12983
abhishekagarwal87 merged 18 commits intoapache:masterfrom
317brian:merge-msq-docs

Conversation

@317brian
Copy link
Copy Markdown
Contributor

This PR adds all the documentation updates for the multi-stage query architecture and the MSQ task engine

Note that the screenshot on the Druid console page is outdated and will get updated in a subsequent PR.

This PR has:

  • been self-reviewed.

add back theta sketches tutoria

change filename

fix filename

fix link

fix headings
@317brian
Copy link
Copy Markdown
Contributor Author

317brian commented Aug 26, 2022

@2bethere @gianm @vogievetsky please review the OSS docs as a whole for MSQ

@cryptoe please review the known issues list to see if there's anything to be added or removed: https://github.com/apache/druid/pull/12983/files#diff-e83f8a116cd5e34642cfe3474e915dd586d4679547ac393a6dea654a78c9bbec

@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 29, 2022
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


### PARTITIONED BY

INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITION BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make 5 million bold ?

- EXTERN does not accept `druid` input sources.

## Missing guardrails

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maximum number of input files. No guardrail today means the controller can potentially run out of memory tracking them all.

Comment thread docs/multi-stage-query/msq-tutorial-connect-external-data.md Outdated
@gianm gianm added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Aug 29, 2022

> SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental.

By default, the multi-stage query task engine (MSQ task engine) uses the local storage of a node to store data from intermediate steps when executing a query. Although this method provides better speed when executing a query, the data is lost if the node encounters an issue. When you enable durable storage, intermediate data is stored in Amazon S3 instead. Using this feature can improve the reliability of queries that use more than 20 workers. In essence, you trade some performance for better reliability. This is especially useful for long-running queries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the "20 workers" language come from?

3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer.

### INSERT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think somewhere here we should call out a note about how the MSQ INSERT syntax (REPLACE also, but this has more impact for INSERT) deviates from the SQL standard in that the column are mapped by name and not positionally.

Maybe something like:

Please note that unlike standard SQL the data is inserted according to column name and not positionally which means that it is important to get the output column names of subsequent inserts to be the same as the table and not to simply reply on their positions within the SELECT clause.

Comment thread docs/multi-stage-query/msq-api.md Outdated

|Field|Description|
|-----|-----------|
| taskId | Controller task ID. You can use Druid's standard [task APIs](../operations/api-reference.md#overlord) to interact with this controller task.|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a strange indent here or is the GitHub UI buggy

Comment thread docs/multi-stage-query/msq-api.md Outdated

Currently, the MSQ task engine ignores the provided values of `resultFormat`, `header`,
`typesHeader`, and `sqlTypesHeader`. SQL SELECT queries always behave as if `resultFormat` is an array, `header` is
true, `typesHeader` is true, and `sqlTypesHeader` is true.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change

SQL SELECT queries always behave as if `resultFormat` is an array, `header` is
true, `typesHeader` is true, and `sqlTypesHeader` is true.

To

SQL SELECT queries write out their results into the task report (in the `multiStageQuery.payload.results.results` key) formatted as if `resultFormat` is an `array`.

@techdocsmith techdocsmith mentioned this pull request Aug 30, 2022
1 task
@317brian 317brian requested a review from vogievetsky August 31, 2022 15:40
Copy link
Copy Markdown
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread docs/multi-stage-query/msq-known-issues.md Outdated
To interact with a query through the Overlord API, you need the following permissions:

- INSERT or REPLACE queries: You must have READ DATASOURCE permission on the output datasource.
- SELECT queries: You must have read permissions on the `__query_select` datasource, which is a stub datasource that gets created.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should read ... that does not get created.

Comment thread website/sidebars.json
"ingestion/tasks",
"ingestion/faq"
],
"SQL-based ingestion": [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs two spaces of leading indent

@317brian
Copy link
Copy Markdown
Contributor Author

317brian commented Sep 6, 2022

The CI is failing on the licensing check and JDK 8 packaging check:
image

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

@317brian - You need to add the copyright header in text files.

@abhishekagarwal87 abhishekagarwal87 merged commit d4233ef into apache:master Sep 6, 2022
abhishekagarwal87 pushed a commit that referenced this pull request Sep 6, 2022
* msq: add multi-stage-query docs

* add screenshots

add back theta sketches tutoria

change filename

fix filename

fix link

fix headings

* fixes

* fixes

* fix spelling issues and update spell file

* address feedback from karan

* add missing guardrail to known issues

* update blurb

* fix typo

* remove durable storage info

* update titles

* Restore en.json

* Update query view

* address comments from vad

* Update docs/multi-stage-query/msq-known-issues.md

finish sentence

* add apache license to docs

* add apache license to docs

Co-authored-by: Katya Macedo <katya.macedo@imply.io>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants